aspose file tools*
The moose likes I/O and Streams and the fly likes PDFBox throws OutOfMemory error Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Java » I/O and Streams
Bookmark "PDFBox throws OutOfMemory error" Watch "PDFBox throws OutOfMemory error" New topic
Author

PDFBox throws OutOfMemory error

Konstantinos Vasileiou
Greenhorn

Joined: Jul 20, 2009
Posts: 16
Hi guys,
I am using PDFBox to parse sequentially a large number of PDF files, get their text, and write in another document. What I have done is creating a method


that accepts as arguments the name of the file and the Writer to the new file that contains the concatenation of the textual data.

The error I get, when the parsing reaches one specific file, is:





Any ideas what is going wrong?
Konstantinos Vasileiou
Greenhorn

Joined: Jul 20, 2009
Posts: 16
To add some more details to my previous post:
Increasing the heap size does not help at all - I tried up to 512Mb getting the same error at this particular pdf.
I made a simple prototype application like this:



and the system throws exactly the same error at this damned file.
Jarred Olson
Ranch Hand

Joined: Jul 31, 2009
Posts: 37
How big is the PDF?

I've never used the PDFBox, but I have used itext (http://www.lowagie.com/iText/) and that worked pretty well for me but I was generating PDFs and not reading them.
Konstantinos Vasileiou
Greenhorn

Joined: Jul 20, 2009
Posts: 16
The PDF is not very big (1.8Mb) and PDFBox works fine with much larger files.
Jarred Olson
Ranch Hand

Joined: Jul 31, 2009
Posts: 37
Again, I've never used PDFBox so I'm not sure if you can do this or not (I know you can do it with java.io.*) but you might want to try reading it in line by line to try and keep your heap size down.
Konstantinos Vasileiou
Greenhorn

Joined: Jul 20, 2009
Posts: 16
Jarred Olson wrote:Again, I've never used PDFBox so I'm not sure if you can do this or not (I know you can do it with java.io.*) but you might want to try reading it in line by line to try and keep your heap size down.

I am sure you can do this... at least I do not know such a method of PDFBox. The getText() method extracts all the text at once, but as I can guess from the description of the error message, PDFBox also uses the structure of the pdf document, so I do not know if parsing line by line can exist, similarly to a "flat" I/O stream.
 
wood burning stoves
 
subject: PDFBox throws OutOfMemory error