• Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

PDFBox throws OutOfMemory error

 
Konstantinos Vasileiou
Greenhorn
Posts: 16
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi guys,
I am using PDFBox to parse sequentially a large number of PDF files, get their text, and write in another document. What I have done is creating a method


that accepts as arguments the name of the file and the Writer to the new file that contains the concatenation of the textual data.

The error I get, when the parsing reaches one specific file, is:





Any ideas what is going wrong?
 
Konstantinos Vasileiou
Greenhorn
Posts: 16
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
To add some more details to my previous post:
Increasing the heap size does not help at all - I tried up to 512Mb getting the same error at this particular pdf.
I made a simple prototype application like this:



and the system throws exactly the same error at this damned file.
 
Jarred Olson
Ranch Hand
Posts: 37
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
How big is the PDF?

I've never used the PDFBox, but I have used itext (http://www.lowagie.com/iText/) and that worked pretty well for me but I was generating PDFs and not reading them.
 
Konstantinos Vasileiou
Greenhorn
Posts: 16
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
The PDF is not very big (1.8Mb) and PDFBox works fine with much larger files.
 
Jarred Olson
Ranch Hand
Posts: 37
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Again, I've never used PDFBox so I'm not sure if you can do this or not (I know you can do it with java.io.*) but you might want to try reading it in line by line to try and keep your heap size down.
 
Konstantinos Vasileiou
Greenhorn
Posts: 16
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Jarred Olson wrote:Again, I've never used PDFBox so I'm not sure if you can do this or not (I know you can do it with java.io.*) but you might want to try reading it in line by line to try and keep your heap size down.

I am sure you can do this... at least I do not know such a method of PDFBox. The getText() method extracts all the text at once, but as I can guess from the description of the error message, PDFBox also uses the structure of the pdf document, so I do not know if parsing line by line can exist, similarly to a "flat" I/O stream.
 
It is sorta covered in the JavaRanch Style Guide.
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic