I have a very large xml file 300,000kb that I need to parse. (size can vary and could be larger)
I need to look for a certain start tag, and then it's corresponding close tag, and then process that chunk before moving on to the next one.
My first thought was that I could use Smooks' splitting and routing but I have been unable to find a way to make it work. You can split but the only options for routing seem to be file, jms, or database. I really just want to route the chunk to a class/method so that I can check see if the chunk qualifies and then decide what to do with it.
I have also experimented with Readers and FileChannels, but there doesn't seem to be an easy and fast way to accomplish this task.
Any ideas? I'm not overly familiar with IO as I don't often code it, so I'm hoping that I'm just overlooking an obvious solution.
I don't know anything about Smooks but when I looked at their home page it actually said "JMS, File, Database etc". So perhaps the "etc" part would cover your requirement?
I don't see how Readers and Channels would help at all. You need to parse the XML so you need an XML parser, which is at a higher level than the low-level choice of file access methods. So you can't do anything until you choose your parser.
I don't know anything about Smooks either, but it seems to me that what you want to use a SAX parser. Unlike DOM parsers, which force to load an entire XML file into memory before working with it, a SAX parser lets you parse as you go, if you get my meaning. You don't have to do anything clever to split the file into chunks. Just use a normal buffered reader to read a block, stream it through the SAX parser, and then go on to the next block.
Both Simple API for XML (SAX) and Document Object Model (DOM) are programming API which are used to write XML-data processing applications. Neither one of them is an actual XML Parser.
SAX enables you, the programmer, to write code that receives data, i.e. method calls, from a parser. SAX is a low-level API as it communicates directly with a SAX-compliant XML parser. You write the application based on the SAX API.
DOM is a higher-level API that builds an object model based on an XML instance which uses the SAX API internally. You, the programmer, then write your application based on the DOM API not the SAX API.
Apache Xerces is the most popular XML parser and a reference implementation was added to the Java SE some time ago.
In regards to chunking and writing a Java-based application to do this, you would certainly need to write to the SAX API. If you are planning to pass this XML fragment to a method, you need to make sure that you create a small enough chunk so you don't exceed the memory of your JRE instance.
A good alternative for this would be writing the chunking code in Perl and then reading the chunks with a SAX or DOM applicaiton