Win a copy of Clojure in Action this week in the Clojure forum!
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

Mispositioned textboxes in Reading doc, pdf files using Apache POI and Apache PDFBox

 
Vipul Kumar
Greenhorn
Posts: 4
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I am trying to read and process .doc, .docx, .pdf files in Java by converting them into a single string using **Apache POI** (for doc,docx) and **Apache PDFBox** (for pdf) libraries.
It works fine until it encounters textboxes.

If the format is like this:
paragraph 1
textbox 1
paragraph 2
textbox 2
paragraph 3

Then the output should be:
paragraph 1 textbox 1 paragraph 2 textbox 2 paragraph 3
But the output I am getting is:
paragraph 1 paragraph 2 paragraph 3 textbox 1 textbox 2
It seems to be adding textboxes at the end and not at the place where it should be, ie between the paragraphs. This problem is both in the cases of doc and pdf files. That means both libraries, POI and PDFBox are giving the same problem.
The code for reading pdf file is:



And code for doc file is:


 
Ulf Dittmer
Rancher
Pie
Posts: 42966
73
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Those text extractors work sequentially along the contents of the structured file format; if that order is different than the order in which the text is displayed, then that's the way the text will be extracted. For POI I'd assume that you can use its API to get at content in its proper ordering. Not sure if PDFBox has a similar API for PDFs.
 
Vipul Kumar
Greenhorn
Posts: 4
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Ulf Dittmer wrote:Those text extractors work sequentially along the contents of the structured file format; if that order is different than the order in which the text is displayed, then that's the way the text will be extracted. For POI I'd assume that you can use its API to get at content in its proper ordering. Not sure if PDFBox has a similar API for PDFs.

Thanks for your reply. Can you show an example to that with POI? I tried but couldn't find anything useful.
 
Ulf Dittmer
Rancher
Pie
Posts: 42966
73
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
That's not at all going to be easy. You'll need to do a fair amount of trial-and-error to see how to piece the various classes and methods of the API together. The HWPF/XWPF part of POI is not nearly as well documented or as widely used as the HSSF/XSSF part that deals with XLS/XLSX files.

Creating a document: http://www.coderanch.com/how-to/java/CreateWordDocument (so you'd need to do the reverse)

Extremely short API intro: http://poi.apache.org/hwpf/quick-guide.html

Test case source code: http://svn.apache.org/viewvc/poi/trunk/src/scratchpad/testcases/org/apache/poi/hwpf/
 
I agree. Here's the link: http://aspose.com/file-tools
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic