Copy text within bounding coordinates in PDF

mdhazlee

New Member
Hi, I want to extract texts within coordinates in a PDF, which would correspond to certain field headers and values. I found a site that has the C# code which I would like to use in a code stage.

http://www.rasteredge.com/how-to/csharp-imaging/pdf-text-extract/

Since I'm new to programming, I don't know how to include those namespaces or if there are any dll files that needs to be downloaded somewhere before I can use the code.

My code is as follows:

1515039415294.png

However, it gives a lot of errors when checking the code. Could someone pls help?

Thanks,
Hazlee
 

mdhazlee

New Member
Hi @anisjolly , thank you for the tip. I've downloaded the libraries. If I want to use these libraries in a Blue Prism code stage, should I place it within Blue Prism folder or it can be placed anywhere? So far, I couldn't find any resources that would explain how this could be done.

Hazlee
 

adebroise

RPA Ninja
Staff member
@mdhazlee You need to set your namespaces and libraries within the information box on the initialize tab of the object. You'll need to switch the language type on this page too as the default language is set to VB.net not c#.

Also you don't need to declare the function name in the code stage, just the code between your curly braces.
 

tgundhus

Member
Hello.

As @adebroise said, coding language and namespaces are defined in the intialise stage in your object (double click on the page description box).
You will need to include the dll and namespaces under the code-option tab.

After this is done, you can just start to put your function-code directly in the BP-code tab and BP will automatically create a function for it.
You can also use call other pages as a function, please let me know if there is any questions recarding use of code stages or best practises.

PS: However when it comes to use of libary, iTextSharp is one option as @anisjolly mentioned, but please remember that iTextSharp is under AGPL license.
 

mdhazlee

New Member
Hi everyone,

I've managed to finally get some output using itextsharp to read text within a region from PDF. The code inserted in the global code is as follows:

Code:
public string ReadtextwithinPDF(string fileName)
{
    StringBuilder sb = new StringBuilder();
    if (File.Exists(fileName)){
        PdfReader reader = new PdfReader(fileName);
        Rectangle rect = new Rectangle(0, 0, 720, 540);
        RenderFilter[] filter = {new RegionTextRenderFilter(rect)};
        ITextExtractionStrategy strategy;
        for (int i = 1; i <= reader.NumberOfPages; i++) {
            strategy = new FilteredTextRenderListener(new LocationTextExtractionStrategy(), filter);
            sb.AppendLine(PdfTextExtractor.GetTextFromPage(reader, i, strategy));
        }
    reader.Close();
}
return sb.ToString();
}

This will read any text within the bounding rectangle region (0,0,720,540). I couldn't find any satisfactory explanation on how to accurately determine the rectangle or change it such that it falls within the region that I want to read within the PDF, except by trial and error. If anyone knows how to get a more accurate value for the rectangle region, pls share.

Thanks!
Hazlee
 

Uthaiah

Member
Hi everyone,

I've managed to finally get some output using itextsharp to read text within a region from PDF. The code inserted in the global code is as follows:

Code:
public string ReadtextwithinPDF(string fileName)
{
    StringBuilder sb = new StringBuilder();
    if (File.Exists(fileName)){
        PdfReader reader = new PdfReader(fileName);
        Rectangle rect = new Rectangle(0, 0, 720, 540);
        RenderFilter[] filter = {new RegionTextRenderFilter(rect)};
        ITextExtractionStrategy strategy;
        for (int i = 1; i <= reader.NumberOfPages; i++) {
            strategy = new FilteredTextRenderListener(new LocationTextExtractionStrategy(), filter);
            sb.AppendLine(PdfTextExtractor.GetTextFromPage(reader, i, strategy));
        }
    reader.Close();
}
return sb.ToString();
}

This will read any text within the bounding rectangle region (0,0,720,540). I couldn't find any satisfactory explanation on how to accurately determine the rectangle or change it such that it falls within the region that I want to read within the PDF, except by trial and error. If anyone knows how to get a more accurate value for the rectangle region, pls share.

Thanks!
Hazlee

Dear Hazlee,
did you use iTextSharp to get your code?
 

lijin619

New Member
@mdhazlee , How you add the itextsharp to BP? means i have downloaded the zip file(iText 7) from Github and extracted that. after that how we can add and which file added to bp.
 

mdhazlee

New Member
Thanks Hazlee, I've project where i need to edit the PDF, will this code works for editing as well? if so, would you be able to help me with the steps?

Hi Uthaiah, unfortunately I copied and paste and made adjustments to the code just to make it work for my process. At this point in time. I have no experience in editing PDF using iTextSharp. I'm sure the library allow you to create PDF, but I'm not sure about editing PDF.
 
Top