File APIs for Java Developers
Manipulate DOC, XLS, PPT, PDF and many others from your application.
http://aspose.com/file-tools
The moose likes I/O and Streams and the fly likes PDFBox throws OutOfMemory error Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Java » I/O and Streams
Bookmark "PDFBox throws OutOfMemory error" Watch "PDFBox throws OutOfMemory error" New topic
Author

PDFBox throws OutOfMemory error

Konstantinos Vasileiou
Greenhorn

Joined: Jul 20, 2009
Posts: 16
Hi guys,
I am using PDFBox to parse sequentially a large number of PDF files, get their text, and write in another document. What I have done is creating a method


that accepts as arguments the name of the file and the Writer to the new file that contains the concatenation of the textual data.

The error I get, when the parsing reaches one specific file, is:





Any ideas what is going wrong?
Konstantinos Vasileiou
Greenhorn

Joined: Jul 20, 2009
Posts: 16
To add some more details to my previous post:
Increasing the heap size does not help at all - I tried up to 512Mb getting the same error at this particular pdf.
I made a simple prototype application like this:



and the system throws exactly the same error at this damned file.
Jarred Olson
Ranch Hand

Joined: Jul 31, 2009
Posts: 37
How big is the PDF?

I've never used the PDFBox, but I have used itext (http://www.lowagie.com/iText/) and that worked pretty well for me but I was generating PDFs and not reading them.
Konstantinos Vasileiou
Greenhorn

Joined: Jul 20, 2009
Posts: 16
The PDF is not very big (1.8Mb) and PDFBox works fine with much larger files.
Jarred Olson
Ranch Hand

Joined: Jul 31, 2009
Posts: 37
Again, I've never used PDFBox so I'm not sure if you can do this or not (I know you can do it with java.io.*) but you might want to try reading it in line by line to try and keep your heap size down.
Konstantinos Vasileiou
Greenhorn

Joined: Jul 20, 2009
Posts: 16
Jarred Olson wrote:Again, I've never used PDFBox so I'm not sure if you can do this or not (I know you can do it with java.io.*) but you might want to try reading it in line by line to try and keep your heap size down.

I am sure you can do this... at least I do not know such a method of PDFBox. The getText() method extracts all the text at once, but as I can guess from the description of the error message, PDFBox also uses the structure of the pdf document, so I do not know if parsing line by line can exist, similarly to a "flat" I/O stream.
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: PDFBox throws OutOfMemory error
 
Similar Threads
Not able to convert the image ebbeded in PDF page while converting with page.convert
Problem with PDFBox
Problem using PDFBox to extract text from PDF documents
PDFBox und Lucene in Eclipse und Netbeans
I hate NullPointerExceptions!