The moose likes Java in General and the fly likes PDFBox: pdf's markup, how-to extract the pdf markup... Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Java » Java in General
Bookmark "PDFBox: pdf Watch "PDFBox: pdf New topic

PDFBox: pdf's markup, how-to extract the pdf markup...

Jim Harrison

Joined: Mar 16, 2007
Posts: 29

I've read alot on http://pdfbox.apache.org but can't find an example or if the tool actually does this.

The pdf file that I'm reading has superscripts. I wanted to get the text and markup content of a pdf file. So a couple of questions:

1. can PDFBox do this? I see on their website the ExtractText (http://pdfbox.apache.org/commandlineutilities/ExtractText.html) but that just displays the text aspect of the pdf.

2. does any one have an example of doing this?

Ulf Dittmer

Joined: Mar 22, 2005
Posts: 39549
No, PDFBox has no notion of extracting layout information.

You could check out at the source code of https://pdf-renderer.dev.java.net/, which can display PDFs, so it must have a way of accessing the layout data.

Ping & DNS - updated with new look and Ping home screen widget
I agree. Here's the link: http://aspose.com/file-tools
subject: PDFBox: pdf's markup, how-to extract the pdf markup...
Similar Threads
pdf to text
Pdf generation from html
Convert PDF to Image by specifying page range using JPedal
how is the quality of the Lucene ports
PdfBox, do you have to save the .pdf to a file?