Scrap unstructured data from PDF

Amith V S · Jul 10, 2018

Hi guys, I am stuck with something. I have a pdf with numerous pages and each page have datas which may vary from page to page. Each data have headings which is the only thing fixed in the pdf. The position keeps on varying. Even if we scrape it may work for some pages, but not completely. So I am looking for a general solution to scrape data when position keeps on varying.

Lets say I need to scrap data in between 2 points A and B. The size of the data may vary, it maybe a one liner or more.

Methods used: Relative Screen Scrapping

bnastase · Jul 12, 2018

Hi Amith,

I suggest you to use regular expressions for this. You can find more info about them here.

KR

Amith V S · Jul 12, 2018

I converted pdf to text and tried substring method to get the data between two headings.
Substring code used: rdtext.IndexOf("heading A") - rdtext.IndexOf("heading B")

but this is not working always. Also I am new with regex, any source that I can use to get a basic idea

bnastase · Jul 12, 2018

Amith V S said:
I converted pdf to text and tried substring method to get the data between two headings.
Substring code used: rdtext.IndexOf("heading A") - rdtext.IndexOf("heading B")

but this is not working always. Also I am new with regex, any source that I can use to get a basic idea

This video is great for giving you an idea about regex

Amith V S · Jul 12, 2018

thanx mate

Scrap unstructured data from PDF

Amith V S

New Member

bnastase

Member

Amith V S

New Member

bnastase

Member

Amith V S

New Member