my dog learned polymorphism
The moose likes Java in General and the fly likes Extracting text with formatting using PDFBox Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Java » Java in General
Bookmark "Extracting text with formatting using PDFBox" Watch "Extracting text with formatting using PDFBox" New topic

Extracting text with formatting using PDFBox

Mattie James

Joined: Dec 19, 2008
Posts: 3
Hi guys,

I have been looking at a way to extract text from PDF documents using Java, and the best (free) solution I could find seems to be PDFBox. The tool does seem pretty nice, but I am struggling to understand how it works properly beyond just using the included classes. The tool comes with a class called "TextStripper" that does indeed take text from a pdf, but unfortunately all the formatting is lost. The work I need to do requires formatting to be retained as the file is read into Java as decisions need to be made based on whether the text was a title, header, body text etc.

I did of course check the sourceforge forums for the PDFBox project, but they appear to not have been enabled.

I would greatly appreciate someone who is familar with the tool, or just a Java guru, explaining to me how I can take text and retain the formatting, as I can't get my head around it. Unfortunately its not as simple as:

Thanks for any help guys.
I agree. Here's the link:
subject: Extracting text with formatting using PDFBox
jQuery in Action, 3rd edition