aspose file tools*
The moose likes Beginning Java and the fly likes PDF parsing Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


Win a copy of Spring in Action this week in the Spring forum!
JavaRanch » Java Forums » Java » Beginning Java
Bookmark "PDF parsing" Watch "PDF parsing" New topic
Author

PDF parsing

Jina Lu
Greenhorn

Joined: Jul 09, 2010
Posts: 26

Hi,
I searched the forum, googled, but couldn't find the answer.
For my current project I need library for pdf parsing. I need to extract text, images, bookmarks, annotations and security information. I tried pdfbox and itext, but both seam to have problems with custom font encoding. Non English characters are corrupted.
There is no standart template or creating tool for pdfs my program receive so this problem is quite essential.
Please recommend what library should be use, that this could be solved?
Jesper de Jong
Java Cowboy
Saloon Keeper

Joined: Aug 16, 2005
Posts: 14266
    
  21

Jina Lu wrote:I tried pdfbox and itext, but both seam to have problems with custom font encoding. Non English characters are corrupted.

Are you sure that is because of bugs or lack of support in those libraries, or was it just a bug in how your program handled and displayed the data?

Java Beginners FAQ - JavaRanch SCJP FAQ - The Java Tutorial - Java SE 8 API documentation
Jina Lu
Greenhorn

Joined: Jul 09, 2010
Posts: 26

Jesper, surely it is related to encoding, but also might be that I'm missing something in code, to fix that. I tried with different pdf files. If encoding is Identity-H, Ansi any not custom, I'm getting correct output.

PDFBOX:


iText:
Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 42264
    
  64
Offhand I would also assume that the problem with non-ACII text are not intrinsic to the libraries you're using. You're not printing the characters to a console or terminal or a flat file that only supports ASCII; right?

But more importantly, I don't think there's an easy solution, and certainly not an easy free solution, to the underlying problem. if the two libraries you mention can extract all you need - great, but if not then it gets a lot harder. PDF-Renderer can display PDFs, so obviously it knows a lot about PDF internals; maybe that could be a starting point.


Ping & DNS - my free Android networking tools app
Jina Lu
Greenhorn

Joined: Jul 09, 2010
Posts: 26

Thanks for reply, Ulf Dittmer, parsing result is String which I output to text file using commons-io: FileUtils.writeStringToFile(new File("resulr.txt"), text);
But I debugged and I see corrupted characters in String.
Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 42264
    
  64
Jina Lu wrote:But I debugged and I see corrupted characters in String.

How, exactly, did you do that?
Winston Gutkowski
Bartender

Joined: Mar 17, 2011
Posts: 8008
    
  22

Jina Lu wrote:Thanks for reply, Ulf Dittmer, parsing result is String which I output to text file using commons-io: FileUtils.writeStringToFile(new File("resulr.txt"), text);
But I debugged and I see corrupted characters in String.

I think Ulf is right: the problem you're trying to solve, or more specifically, the level of detail you're trying to solve it at, may be an issue. PDF was a proprietary format until 2008, and although it has since been made a published standard under ISO-32000 (you can find a copy here), that doesn't mean that Adobe have made any effort to make it "user-friendly" (it's 756 pages, just in case you're interested).

One thing that seems clear though (If you look at sections 9.1 and 9.7 in the link I provided), is that processing - especially of the 'Tj' tag - is very different if the font is not one of the 14 "standard" fonts, so it's quite possible that the libraries you mention only provide limited text extraction capability; you'd have to read their manuals to find out. Certainly you can store non-Ascii text characters in a Java String though.

Another possibility might be to convert the document to something like Lucene or OpenOffice and then use that tool to extract the information you want. Both have extensive Java libraries for parsing their own doc formats and one would assume, since they're in the business of document processing, that they've made their conversion utility as comprehensive as they can. How far it will translate security information though, I have no idea - Adobe may still protect that sort of stuff.

Best of luck.

Winston

Isn't it funny how there's always time and money enough to do it WRONG?
Articles by Winston can be found here
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: PDF parsing