This week's book giveaway is in the Jobs Discussion forum.
We're giving away four copies of Java Interview Guide and have Anthony DePalma on-line!
See this thread for details.
The moose likes I/O and Streams and the fly likes Chunking an overly large xml file? Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login

Win a copy of Java Interview Guide this week in the Jobs Discussion forum!
JavaRanch » Java Forums » Java » I/O and Streams
Bookmark "Chunking an overly large xml file?" Watch "Chunking an overly large xml file?" New topic

Chunking an overly large xml file?

Shane Burgel
Ranch Hand

Joined: Sep 09, 2003
Posts: 47
I have a very large xml file 300,000kb that I need to parse. (size can vary and could be larger)

I need to look for a certain start tag, and then it's corresponding close tag, and then process that chunk before moving on to the next one.

My first thought was that I could use Smooks' splitting and routing but I have been unable to find a way to make it work. You can split but the only options for routing seem to be file, jms, or database. I really just want to route the chunk to a class/method so that I can check see if the chunk qualifies and then decide what to do with it.

I have also experimented with Readers and FileChannels, but there doesn't seem to be an easy and fast way to accomplish this task.

Any ideas? I'm not overly familiar with IO as I don't often code it, so I'm hoping that I'm just overlooking an obvious solution.

Paul Clapham

Joined: Oct 14, 2005
Posts: 19973

I don't know anything about Smooks but when I looked at their home page it actually said "JMS, File, Database etc". So perhaps the "etc" part would cover your requirement?

I don't see how Readers and Channels would help at all. You need to parse the XML so you need an XML parser, which is at a higher level than the low-level choice of file access methods. So you can't do anything until you choose your parser.
Greg Charles

Joined: Oct 01, 2001
Posts: 2969

I don't know anything about Smooks either, but it seems to me that what you want to use a SAX parser. Unlike DOM parsers, which force to load an entire XML file into memory before working with it, a SAX parser lets you parse as you go, if you get my meaning. You don't have to do anything clever to split the file into chunks. Just use a normal buffered reader to read a block, stream it through the SAX parser, and then go on to the next block.
Jimmy Clark
Ranch Hand

Joined: Apr 16, 2008
Posts: 2187
Both Simple API for XML (SAX) and Document Object Model (DOM) are programming API which are used to write XML-data processing applications. Neither one of them is an actual XML Parser.

SAX enables you, the programmer, to write code that receives data, i.e. method calls, from a parser. SAX is a low-level API as it communicates directly with a SAX-compliant XML parser. You write the application based on the SAX API.

DOM is a higher-level API that builds an object model based on an XML instance which uses the SAX API internally. You, the programmer, then write your application based on the DOM API not the SAX API.

Apache Xerces is the most popular XML parser and a reference implementation was added to the Java SE some time ago.

In regards to chunking and writing a Java-based application to do this, you would certainly need to write to the SAX API. If you are planning to pass this XML fragment to a method, you need to make sure that you create a small enough chunk so you don't exceed the memory of your JRE instance.

A good alternative for this would be writing the chunking code in Perl and then reading the chunks with a SAX or DOM applicaiton
Rob Spoor

Joined: Oct 27, 2005
Posts: 20279

There are (at least) two more alternatives to SAXParser and DocumentBuilder: XMLEventReader and XMLStreamReader.

How To Ask Questions How To Answer Questions
I agree. Here's the link:
subject: Chunking an overly large xml file?
It's not a secret anymore!