File APIs for Java Developers
Manipulate DOC, XLS, PPT, PDF and many others from your application.
http://aspose.com/file-tools
The moose likes Java in General and the fly likes Mispositioned textboxes in Reading doc, pdf files using Apache POI and Apache PDFBox Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


Win a copy of Murach's Java Servlets and JSP this week in the Servlets forum!
JavaRanch » Java Forums » Java » Java in General
Bookmark "Mispositioned textboxes in Reading doc, pdf files using Apache POI and Apache PDFBox" Watch "Mispositioned textboxes in Reading doc, pdf files using Apache POI and Apache PDFBox" New topic
Author

Mispositioned textboxes in Reading doc, pdf files using Apache POI and Apache PDFBox

Vipul Kumar
Greenhorn

Joined: Jun 27, 2012
Posts: 4
I am trying to read and process .doc, .docx, .pdf files in Java by converting them into a single string using **Apache POI** (for doc,docx) and **Apache PDFBox** (for pdf) libraries.
It works fine until it encounters textboxes.

If the format is like this:
paragraph 1
textbox 1
paragraph 2
textbox 2
paragraph 3

Then the output should be:
paragraph 1 textbox 1 paragraph 2 textbox 2 paragraph 3
But the output I am getting is:
paragraph 1 paragraph 2 paragraph 3 textbox 1 textbox 2
It seems to be adding textboxes at the end and not at the place where it should be, ie between the paragraphs. This problem is both in the cases of doc and pdf files. That means both libraries, POI and PDFBox are giving the same problem.
The code for reading pdf file is:



And code for doc file is:


Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 41073
    
  43
Those text extractors work sequentially along the contents of the structured file format; if that order is different than the order in which the text is displayed, then that's the way the text will be extracted. For POI I'd assume that you can use its API to get at content in its proper ordering. Not sure if PDFBox has a similar API for PDFs.


Ping & DNS - my free Android networking tools app
Vipul Kumar
Greenhorn

Joined: Jun 27, 2012
Posts: 4
Ulf Dittmer wrote:Those text extractors work sequentially along the contents of the structured file format; if that order is different than the order in which the text is displayed, then that's the way the text will be extracted. For POI I'd assume that you can use its API to get at content in its proper ordering. Not sure if PDFBox has a similar API for PDFs.

Thanks for your reply. Can you show an example to that with POI? I tried but couldn't find anything useful.
Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 41073
    
  43
That's not at all going to be easy. You'll need to do a fair amount of trial-and-error to see how to piece the various classes and methods of the API together. The HWPF/XWPF part of POI is not nearly as well documented or as widely used as the HSSF/XSSF part that deals with XLS/XLSX files.

Creating a document: http://www.coderanch.com/how-to/java/CreateWordDocument (so you'd need to do the reverse)

Extremely short API intro: http://poi.apache.org/hwpf/quick-guide.html

Test case source code: http://svn.apache.org/viewvc/poi/trunk/src/scratchpad/testcases/org/apache/poi/hwpf/
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: Mispositioned textboxes in Reading doc, pdf files using Apache POI and Apache PDFBox
 
Similar Threads
Regarding .DOC file
Read .doc file using POI Library
java.io.FileNotFoundException
problem with reading text by POI Apache.
apache POI - HWPF search & replace