wood burning stoves 2.0*
The moose likes XML and Related Technologies and the fly likes need to break a huge xml into smaller groups one by one without loading the whole xml Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Engineering » XML and Related Technologies
Bookmark "need to break a huge xml into smaller groups one by one without loading the whole xml" Watch "need to break a huge xml into smaller groups one by one without loading the whole xml" New topic
Author

need to break a huge xml into smaller groups one by one without loading the whole xml

Tanveer Rameez
Ranch Hand

Joined: Dec 11, 2000
Posts: 158
Hi,
I have a xml file with the following lines:

Now this xml date is very huge, i.e. there are 100s(maybe 1000s) of hotel elements . i.e: <hotel>....</hotel>. I want to process the hotel elements but loading the whole xml data(of all the hotels) will require huge memory. So what I want to do is: extract only one <hotel> elements i.e. all data between <hotel>..</hotels> , send it to the output stream. If the output stream is full, i will wait till it becomes empty before sending the date to the output stream.After that i obtain the next <hotel> element and so on. This will prevent loading all the <hotel> elements in the memory. The output stream will be piped to an input to another class which will process the data for a hotel. So the other class is to get data for one hotel in its inputstream.
I could find 3 ways of doing this:
1.DOM: but I cannot use java Dom api because it loads the entire xml data into memory.
2. SAX: Now if I use java SAX api, it means I have to recreate the entire <hotel>..</hotel>. Plus I want to control when i want to recieve the events. Sax api will fire the events at its will when it parses, not ours. . Note that I have to split the xml data and pass it to a stream, so if I can get the <hotel>..</hotel> data without much processing, it will be good.
3. use Xml pull parsing like MXP http://www.extreme.indiana.edu/xgws/xsoap/xpp/mxp1/index.html )" target="_new" rel="nofollow">(http://www.extreme.indiana.edu/xgws/xsoap/xpp/mxp1/index.html )
This api allows me to control when i want to get the next event:

So wheneverm i encounter a start tag <hotel> , i put all the data following that tag to the output stream till i encounter the end tag </hotel>. then i send the output stream to another class, and wait till the output stream is empty before doing the process again for the next hotel. But the problem is I have to recreate the entire xml data between <hotel> and </hotel> before sending it to the outputstream.
I know I may sound confusing, but I am not an expert in xml and java xml api. Inshort i have an input stream of a huge xml data, and I want to break it up into smaller sub data(based on tag) and pass it to the output stream.
and I want to do this step by step..obtain first sub data and send it, wait till the output stream is empty and then obtain the next sub data and continue. Extracting the sub data one by one prevents loading of the whole data into the memory.
Please help!!! If you knwo any other way other than the 3 ways I wrtoe above, please suggest.
Thanks in advance
Tanveer


Author of JPhotoBrush Pro (www.jphotobrushpro.com)
Lasse Koskela
author
Sheriff

Joined: Jan 23, 2002
Posts: 11962
    
    5
Would it be acceptable to have the output stream block the parsing thread?


Author of Test Driven (2007) and Effective Unit Testing (2013) [Blog] [HowToAskQuestionsOnJavaRanch]
Tanveer Rameez
Ranch Hand

Joined: Dec 11, 2000
Posts: 158
Originally posted by Lasse Koskela:
Would it be acceptable to have the output stream block the parsing thread?


Well, If that can be done, YES.
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: need to break a huge xml into smaller groups one by one without loading the whole xml
 
Similar Threads
adding and removing elements from DOM in java
Which XML technique to use?
Difference between SAX and DOM??
Performance Recommendation for Simple XML Parsing
Xalan memory problem - DOMSource vs. StreamSource