It's not a secret anymore!*
The moose likes Java in General and the fly likes Problem with processing data files of size larger than 350 MB Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


Win a copy of Murach's Java Servlets and JSP this week in the Servlets forum!
JavaRanch » Java Forums » Java » Java in General
Bookmark "Problem with processing data files of size larger than 350 MB" Watch "Problem with processing data files of size larger than 350 MB" New topic
Author

Problem with processing data files of size larger than 350 MB

amit bose
Greenhorn

Joined: Apr 01, 2005
Posts: 25
Hi All,

Please find below the details of my query.

Problem: I need to process a huge(350 MB size) data file in Java.The data file is is basically a concatenation of multiple XMLs together.

What I need to do is..
(a) check if there are some unwanted characters in bewteen the XML tags
(b) If Yes, remove the tags
After the validation stage above, I need to write the file back to disc.

E.g. Input Data file sample (D1)
<?xml version="1.0" encoding="UTF-8"><books><!--- Books1. xml - some more tags go here --></books>some junk here
<?xml version="1.0" encoding="UTF-8"><books><!--- Books2. xml - some more tags go here --></books>
<?xml version="1.0" encoding="UTF-8"><books><!--- Books3. xml - some more tags go here --></books>more junk
<?xml version="1.0" encoding="UTF-8"><books><!--- Books4. xml - some more tags go here --></books>

(Please note that the content in input data file above will appear in a single line; For better readability I have shown indentation of the XMLs)


E.g. Output Data file sample (D2)
<?xml version="1.0" encoding="UTF-8"><books><!--- Books1. xml - some more tags go here --></books>
<?xml version="1.0" encoding="UTF-8"><books><!--- Books2. xml - some more tags go here --></books>
<?xml version="1.0" encoding="UTF-8"><books><!--- Books3. xml - some more tags go here --></books>
<?xml version="1.0" encoding="UTF-8"><books><!--- Books4. xml - some more tags go here --></books>

(The text 'some junk here' and 'more junk' have been removed in D2)


Earlier Solution:

I have shared my code below:


The above code thrown OutOfMemory Error for file size more than 100 MB and it happens when I am trying to read the file.

The next thing that I tried was using buffers to read the file than line-by-line:



The above code worked fine till the data file size was 200 MB or less. However, now I have a data file of 350 MB size and it keeps giving the out of memory error.
Increasing the buffer size does not sound like a good option.

Let me know if there are any pointers for this problem.


Thanks,
Amit
Somnath Mallick
Ranch Hand

Joined: Mar 04, 2009
Posts: 477
I think, since you are getting an out of memory error, it would help if you increase your JVM heap size.

William Brogden
Author and all-around good cowpoke
Rancher

Joined: Mar 22, 2000
Posts: 12760
    
    5
It looks to me like there is only one pass through the file.

Why don't you write chunks of valid data as they are accumulated?

Bill
Carey Brown
Ranch Hand

Joined: Nov 19, 2001
Posts: 173

Your best bet is to process the XML in a serial fashion, in that way you have no memory problems. You could use either the STAX or SAX libraries for this.


Secondly, you are keeping 4 copies of the data in memory: sbfContent(twice), result, and sbfValidatedContent.


sbfContent should be emptied before trying to append to it again.


result and sbfValidatedContent should be released before trying to re-read the file.

amit bose
Greenhorn

Joined: Apr 01, 2005
Posts: 25
Somnath Mallick wrote:I think, since you are getting an out of memory error, it would help if you increase your JVM heap size.



Thanks for the pointer Somnath.

However, I am already using a large heap size as below:

amit bose
Greenhorn

Joined: Apr 01, 2005
Posts: 25
William Brogden wrote:It looks to me like there is only one pass through the file.

Why don't you write chunks of valid data as they are accumulated?

Bill


Thanks for the pointer Bill.

Actually, I wanted to write chunks of valid data as they are accumulated but firstly, I need to read the input data file where the code fails. The input is also not a XML file that could be processes easily but rather multiple XMLs concatenated together.
amit bose
Greenhorn

Joined: Apr 01, 2005
Posts: 25
Carey Brown wrote:Your best bet is to process the XML in a serial fashion, in that way you have no memory problems. You could use either the STAX or SAX libraries for this.


Secondly, you are keeping 4 copies of the data in memory: sbfContent(twice), result, and sbfValidatedContent.


sbfContent should be emptied before trying to append to it again.


result and sbfValidatedContent should be released before trying to re-read the file.




Thanks for the pointer Carey.

I was going through the webpage but it seems STAX API allows to stream XML data. As my input is not a XML file but rather multiple XMLs concatenated together, I am not sure if I can use this. Please correct me if I am wrong.

Also, regarding the duplicacy of data in memory: I will remove the duplicacy but the code fails prior to reaching the duplicated content (i.e. sbfValidatedContent etc.)
Somnath Mallick
Ranch Hand

Joined: Mar 04, 2009
Posts: 477
Since you say that the code is failing at the reading part, I think the sbfContent is becoming too big for the JVM to handle! Could you debug the code and tell us where exactly (which line) the code is failing.
 
Consider Paul's rocket mass heater.
 
subject: Problem with processing data files of size larger than 350 MB
 
Similar Threads
xsl and xslt
NIO Socket weirdnes in Solaris 2.10
FileWriter & UTF-8 Encoding
mixed content
Tutorial on Eclipse Launch Configuration?