This week's book giveaways are in the Refactoring and Agile forums.
We're giving away four copies each of Re-engineering Legacy Software and Docker in Action and have the authors on-line!
See this thread and this one for details.
Win a copy of Re-engineering Legacy Software this week in the Refactoring forum
or Docker in Action in the Agile forum!
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

PDFBox: pdf's markup, how-to extract the pdf markup...

 
Jim Harrison
Ranch Hand
Posts: 30
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hello,

I've read alot on http://pdfbox.apache.org but can't find an example or if the tool actually does this.

The pdf file that I'm reading has superscripts. I wanted to get the text and markup content of a pdf file. So a couple of questions:

1. can PDFBox do this? I see on their website the ExtractText (http://pdfbox.apache.org/commandlineutilities/ExtractText.html) but that just displays the text aspect of the pdf.

2. does any one have an example of doing this?

Thanks...Jim
 
Ulf Dittmer
Rancher
Pie
Posts: 42967
73
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
No, PDFBox has no notion of extracting layout information.

You could check out at the source code of https://pdf-renderer.dev.java.net/, which can display PDFs, so it must have a way of accessing the layout data.
 
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic