File APIs for Java Developers
Manipulate DOC, XLS, PPT, PDF and many others from your application.
The moose likes Java in General and the fly likes Extracting text with formatting using PDFBox Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Java » Java in General
Bookmark "Extracting text with formatting using PDFBox" Watch "Extracting text with formatting using PDFBox" New topic

Extracting text with formatting using PDFBox

Mattie James

Joined: Dec 19, 2008
Posts: 3
Hi guys,

I have been looking at a way to extract text from PDF documents using Java, and the best (free) solution I could find seems to be PDFBox. The tool does seem pretty nice, but I am struggling to understand how it works properly beyond just using the included classes. The tool comes with a class called "TextStripper" that does indeed take text from a pdf, but unfortunately all the formatting is lost. The work I need to do requires formatting to be retained as the file is read into Java as decisions need to be made based on whether the text was a title, header, body text etc.

I did of course check the sourceforge forums for the PDFBox project, but they appear to not have been enabled.

I would greatly appreciate someone who is familar with the tool, or just a Java guru, explaining to me how I can take text and retain the formatting, as I can't get my head around it. Unfortunately its not as simple as:

Thanks for any help guys.
I agree. Here's the link:
subject: Extracting text with formatting using PDFBox
It's not a secret anymore!