Scrap unstructured data from PDF

Amith V S

New Member
Hi guys, I am stuck with something. I have a pdf with numerous pages and each page have datas which may vary from page to page. Each data have headings which is the only thing fixed in the pdf. The position keeps on varying. Even if we scrape it may work for some pages, but not completely. So I am looking for a general solution to scrape data when position keeps on varying.

Lets say I need to scrap data in between 2 points A and B. The size of the data may vary, it maybe a one liner or more.

Methods used: Relative Screen Scrapping
 

Amith V S

New Member
I converted pdf to text and tried substring method to get the data between two headings.
Substring code used: rdtext.IndexOf("heading A") - rdtext.IndexOf("heading B")

but this is not working always. Also I am new with regex, any source that I can use to get a basic idea
 

bnastase

Member
I converted pdf to text and tried substring method to get the data between two headings.
Substring code used: rdtext.IndexOf("heading A") - rdtext.IndexOf("heading B")

but this is not working always. Also I am new with regex, any source that I can use to get a basic idea

This video is great for giving you an idea about regex
 
Top