aspose file tools
The moose likes Java in General and the fly likes XML - CSV ... Performance related question. Big Moose Saloon
  Search | Java FAQ | Recent Topics
Register / Login


Win a copy of The Mikado Method this week in the Agile and other Processes forum!
JavaRanch » Java Forums » Java » Java in General
Reply Bookmark "XML - CSV ... Performance related question." Watch "XML - CSV ... Performance related question." New topic
Author

XML - CSV ... Performance related question.

Johnny Augustus
Greenhorn

Joined: Oct 07, 2006
Posts: 18
Hi,

I have this requirement where I need to read a huge XML file (100+ Megs), parse it (StAX) record by record (boundary conditions as defined by the business logic) into an intermediate data structure (HashMap) and then write the contents of the structure to a file.

What would be the optimum solution to this?

Should I make use of an array of HashMap(s) as the intermediate structure and have one thread parse a record in the XML and put it into the structure and another thread read from the structure and write to the file? The problem is that the method that does this functionality can return only once the entire data in the XML has been written to the file. This method is invoked from a web application. I cannot background it for the time being.

Further, should I use memory mapped files (java.nio) when reading the XML file and writing to the output file?

Is there a way I can monitor the memory usage before invoking this method, midway through the method and at the end of the method?

Thanks
.J.
Johnny Augustus
Greenhorn

Joined: Oct 07, 2006
Posts: 18
Anyone? I am waiting for some of your views on this before I can proceed.
Paul Clapham
Bartender

Joined: Oct 14, 2005
Posts: 16483
    
    2

I would say, if you have a situation where you can write data to some kind of structure and simultaneously read from that structure and write data to a file, you should dispense with the intermediate structure and write directly to the file. The intermediate structure just uses more memory and the processes accessing it just use more processor time. Not to mention more programming complexity.

And I don't see the point of monitoring memory usage. Yeah, you can do that, but what good would it do?

The whole question smells of premature optimization. Write the simplest possible code to start with. Then improve the things that need improving, whatever they are.
Johnny Augustus
Greenhorn

Joined: Oct 07, 2006
Posts: 18
Hi Paul,

The intermediate structure is necessary because there is no clear way that I can sequentially extract the necessary information from the XML and write it to the file. Our business requirement warrants the use of such a data holder.

The memory is an important constraint here because we are not working with high end servers. Besides, I need the benchmark to keep some higher ups satisfied

As of now, I am going ahead with the producer consumer approach. Will keep a track of this thread to see if anyone has any better suggestions.

Many thanks
.J.
William Brogden
Author and all-around good cowpoke
Rancher

Joined: Mar 22, 2000
Posts: 12267
    
    1
How much XML input do you have to parse before you can write a line of CSV?

So far, the Producer Thread / Consumer Thread pattern looks fine to me. Is there anything to be gained by making the intermediate structure a custom Java object?

Bill


Java Resources at www.wbrogden.com
Johnny Augustus
Greenhorn

Joined: Oct 07, 2006
Posts: 18
Here's an example of the conversion that needs to happen

XML


needs to be converted into...

CSV
countries_country_id;countries_country_name;countries_country_states_state_id;countries_country_states_state_name
1;India;1;Maharashtra
1;India;2;Karnataka
2;USA;3;Nevada

Notice the repetition of country id and name in the second line of the CSV.
William Brogden
Author and all-around good cowpoke
Rancher

Joined: Mar 22, 2000
Posts: 12267
    
    1
So basically, because your hierarchy is shallow, all you have to do is hold on to the current country, country id, state and state id and write a CSV line every time you hit a </state> -end element event for "state" - no need for any complicated intermediate object here.

You might hand off each CSV line to a queue for a file writing thread so your XML parser can continue full speed.

Bill
 
I agree. Here's the link: http://zeroturnaround.com/jrebel/download
 
subject: XML - CSV ... Performance related question.
 
Similar Threads
URLyBird read() method
whats the best way ?
JTable challenge
Accessing hashmaps directly from disk instead of memory
XML parsing large complex xml data