GeeCON Prague 2014*
The moose likes XML and Related Technologies and the fly likes Which XML API and Parser for Streaming? Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


JavaRanch » Java Forums » Engineering » XML and Related Technologies
Bookmark "Which XML API and Parser for Streaming?" Watch "Which XML API and Parser for Streaming?" New topic
Author

Which XML API and Parser for Streaming?

Alasdair Jones
Greenhorn

Joined: Mar 04, 2008
Posts: 6
Could anyone recommend an API and Parser for reading streaming XML?

I have to process a series of fairly small XML messages which I receive via a socket over a receive-only network connection.

My app needs to be able to cope with messages coming pretty fast over the connection and I CANT afford to miss any of them. It also will have to cope with incomplete message segments, and segments that span multiple messages. I MUST also be able to process a complete message as soon as it arrives.

I have used the nice and easy JDOM with Xerces in the past, but I am not convinced that it will work here as I can't see anything that says if JDOM can cope with incomplete documents from the input stream, and I have read that Xerces will try to close the socket if data is delayed, and may also prematurely close the socket which I need to keep open. StAX and SAX sound like interesting options... Any recommendations or experience would be very welcome.

Thanks
Paul Clapham
Bartender

Joined: Oct 14, 2005
Posts: 18570
    
    8

I would advise you to abstract out those "segments" and don't get the XML parsing mixed up with that. Provide the parser with an InputStream which contains a single XML document.

As for specific parsers, choose SAX or StAX depending on whether you want the parser to crawl through the document, telling you where it is (SAX), or whether you want to crawl through the document, telling it where you think you are (StAX).
Alasdair Jones
Greenhorn

Joined: Mar 04, 2008
Posts: 6
Thanks, I would if I could. Unfortunately I have no control over the data I will receive and as I said, it will come in segments. Therefore I will have to parse the data before I can determine the message boundaries.

It does sound like I need to use SAX or StAX, although I still don't know if they can cope with incomplete documents, or segments spanning documents. BTW these are APIs not parsers, so my question of which parser to use stands.

Ideas?
William Brogden
Author and all-around good cowpoke
Rancher

Joined: Mar 22, 2000
Posts: 12792
    
    5
Which parser? I know of no reason to look any further than the standard Java library, version 1.6. You are going to find good tutorials and help and if you have to distribute the app, less worry about special JAR files.

You are certainly not in JDOM territory here.

Like Paul said, abstract away the various sources of XML fragments and hand the parser an InputStream. For one thing this will be SOOOOO much easier to test - you can create a set of test case documents and test the parsing side without ever getting tangled up in message connections.

On the other side you can test grabbing messages and creating a stream without every getting hung up in XML parsing issues.

Bill
Paul Clapham
Bartender

Joined: Oct 14, 2005
Posts: 18570
    
    8

Originally posted by Alasdair Jones:
Thanks, I would if I could. Unfortunately I have no control over the data I will receive and as I said, it will come in segments. Therefore I will have to parse the data before I can determine the message boundaries.
Well, you're going to have to work on that. All XML parsers are designed to parse exactly one XML document, no more, no less. So you have a stream of XML documents coming in, chopped up into random chunks? If I understand your word "segment" then that's what you have. I can't say I'm impressed by that design but it is what it is.

I'd say that ByteArrayInputStream and SequenceInputStream could be useful tools. If you have the possibility that a segment could contain part of two different documents then maybe PushbackInputStream too. But I would say this part of the problem is what you have to work on. Choosing an XML parser is just trivia.
Alasdair Jones
Greenhorn

Joined: Mar 04, 2008
Posts: 6
Right, so if I understand correctly, it seems as though I'll have to do some low-level 'parsing' of my own to repackage the XML segments into whole documents, and only then can I send to an XML parser. In which case it really doesn't matter which API I use (streaming or DOM) because I'll have to wait until I've got the whole message before parsing anyway! Thanks for all the replies.
Nitesh Kant
Bartender

Joined: Feb 25, 2007
Posts: 1638

Alasdair: It also will have to cope with incomplete message segments, and segments that span multiple messages.


I am not sure what you mean by the above. Are you referring to a chunked HTTP request kind of protocol?
Is it possible that on the same connection, different messages will be arriving and that too in different parts and order? If that is the case then it will be a nightmare to actually aggregate the message parts. If only one message comes over one connection but in different chunks (parts) then it does make some sense. If this is the case you can look at PipedInputStream
and PipedOutputStream pair. You can use them as a producer and consumer type of exchange in different threads. You can give the InputStream to the XmlParser and populate the OutputStream from the thread reading data from the socket. I am not sure how parsers handle a long delay in the read() of the inputstream but you can manage from your client, in the sense, if you close the OutputStream, the connected InputStream will throw an IOException in read(). So, you can decide the timeout period between the two message chunks and close the PipedOutputStream after that time.


apigee, a better way to API!
William Brogden
Author and all-around good cowpoke
Rancher

Joined: Mar 22, 2000
Posts: 12792
    
    5
I'll have to do some low-level 'parsing' of my own to repackage the XML segments into whole documents, and only then can I send to an XML parser.


NO - not whole documents if you need to combine the data. You just want compatible XML that can be spliced together to make the parser THINK it is looking at a single document - the parsing you have to do could be as simple as removing the XML declaration and providing your own root element. Parsing in the StaX or SAX style can proceed indefinately as long as you keep shoving valid fragments of XML text in the pipeline.

I once took this pass at the problem of combining XML fragments while providing for tracking the real location of errors back to the responsible fragment.

Bill
Alasdair Jones
Greenhorn

Joined: Mar 04, 2008
Posts: 6
OK, 1st just to clarify what my project is:

Single socket connection between 2 applications. This will be opened at initialisation and will have to remain open for the duration of the data exchange.

The destination app which I am writing is the socket server.

The source app is the socket client. I have no control over how this sends messages, and can only receive data. Each message will be sent separately but the source app can't guarantee that these will be sent in a continuous stream and may be split into several segments, although these will be in the correct order. Also, the messages will be sent with no separator/header so the stream I receive could contain multiple messages and/or message segments.

Just to get going I've built a proto with a test socket client which sends a series of messages as a continuous stream. Using JDOM it unsurprisingly throws a parse exception when it reaches the start of the new message "<?xml..." not expecting another in what it believes is the same document. It did however, cope with parsing the data in segments. And of course, this way, I will not be able to get at the data that has been passed...

I'm going to try with SAX/StAX now and then the SequenceInputStream/ByteArrayInputStream...
Nitesh Kant
Bartender

Joined: Feb 25, 2007
Posts: 1638

Originally posted by Alasdair Jones:
Using JDOM it unsurprisingly throws a parse exception when it reaches the start of the new message "<?xml..." not expecting another in what it believes is the same document.

Yeah, that will be the case. So, mostly you have to sniff the data coming on the socket to assert the message boundaries.(I am not sure how will you do that though!)
Alasdair Jones
Greenhorn

Joined: Mar 04, 2008
Posts: 6
Do you think I will get the same problem of a parsing error sending multiple documents to a pure SAX and StAX parser as I did with JDOM?
Paul Clapham
Bartender

Joined: Oct 14, 2005
Posts: 18570
    
    8

Yes. (JDOM isn't a parser as such, it uses SAX to do its parsing.)
Alasdair Jones
Greenhorn

Joined: Mar 04, 2008
Posts: 6
If anyone is interested here is my code to retrieve whole XML docs from the input stream. I realise it's probably not the most efficient method, but I didn't have much luck with StreamTokenizer, and regular expressions:

 
GeeCON Prague 2014
 
subject: Which XML API and Parser for Streaming?