This week's giveaway is in the Android forum.
We're giving away four copies of Android Security Essentials Live Lessons and have Godfrey Nolan on-line!
See this thread for details.
The moose likes XML and Related Technologies and the fly likes StAX: cursor-based parsing Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


Win a copy of Android Security Essentials Live Lessons this week in the Android forum!
JavaRanch » Java Forums » Engineering » XML and Related Technologies
Bookmark "StAX: cursor-based parsing" Watch "StAX: cursor-based parsing" New topic
Author

StAX: cursor-based parsing

surlac surlacovich
Ranch Hand

Joined: Mar 12, 2013
Posts: 296

Hello Folks!
I'm trying to use StAX and have couple questions.
What is the benefits from such a stream based parsing, use-cases? All I know now that if the XML is large - I should prefer SAX, but if small - DOM.
It this a technology which allows for example to read xml from Socket's inputstream, and start parsing as it comes (without waiting for it to download 100%)?
William Brogden
Author and all-around good cowpoke
Rancher

Joined: Mar 22, 2000
Posts: 12761
    
    5
Advantage of stream based - low memory requirement, can start with incomplete stream from - say - a socket.

Disadvantages: tricky to program anything past the simplest data grabber - if you need to manipulate a complex hierarchy use DOM.

DOM - advantages - very useful API can manipulate complex hierarchy.

DOM disadvantages - needs more memory - have to parse the entire document before you can work with it.

Bill
Mark Beardsley
Ranch Hand

Joined: Jun 07, 2013
Posts: 32
    
    1
It's also worth noting that StAX and SAX work slightly differently; StAX is often referred to as a 'pull' parser whilst SAX is a 'push' parser. What this means is quite simple really, with StAX you request the next element while SAX tells you that it already has an element in hand. To put it differently, StAX tells you what it is about to read from the file whilst SAX tells you what it has read. This leads onto one other advantage of StAX in that it is possible both the read and write the xml markup using streams (read elements from one stream and write then to a second one) and thus to reduce the memory footprint. I have used this technique to modify OOXML spreadsheet files which are often so large they cause performance issues with DOM based parsers or api's like POI.
Paul Clapham
Bartender

Joined: Oct 14, 2005
Posts: 18541
    
    8

And following on from that, it's not difficult to generate XML using StAX. That's serialization and not parsing, but it's still worth mentioning. It is possible to do serialization by generating SAX events, but it's rather cumbersome and not nearly as straightforward as the way StAX does it.
g tsuji
Ranch Hand

Joined: Jan 18, 2011
Posts: 499
    
    3
If an article helps better, profiting from the chance of more careful wordings and of more space for developing various points.
http://docs.oracle.com/cd/E17802_01/webservices/webservices/docs/1.6/tutorial/doc/SJSXP2.html
surlac surlacovich
Ranch Hand

Joined: Mar 12, 2013
Posts: 296

Thanks guys!
Mark, very explanatory answer. But in case if you were using SAX for your task (2 streams, read from 1'st - write to 2'nd): you need to download whole thing and save to HDD before you can parse it, right?
The difference from DOM that after you saved it to HDD, you don't need to load whole thing to RAM to perform output to outputstream.

Paul Clapham wrote:It is possible to do serialization by generating SAX events, but it's rather cumbersome and not nearly as straightforward as the way StAX does it.

Yes, I've found StAX way of writing XML very convenient (no need to create Transformer handlers and factories to map to OutputStream, instead use javax.xml.stream.XMLStreamWriter), but I've found many similarities between StAX and SAX of writing XMLs.
For now I have a vision that StAX a little bit more complex tool but provides less overhead (memory, CPU time), so it is makes sen—če to learn how to use it and stick to StAX for almost every task.
Mark Beardsley
Ranch Hand

Joined: Jun 07, 2013
Posts: 32
    
    1
No, you should be able to parse the xml markup from the stream as you are reading it. The only example I have experience with would be working with Excel workbooks stored on a server. Connecting a stream to one of them allowed me to parse the xml markup using StAX without any need to effectively copy the file onto my local hard drive. If I wanted to update the file, as was often the case, the process looked like this;

Open a stream onto the source file so that I could parse the xml.
Open another stream onto a local copy so that I can save the modified result.
Read from the source file and save elements to the local copy until I get to the point where the modifcation needs to be made.
Add/change the necessary elements and write to the local copy.
Read any and all remaining elements from the source file and add to the local copy.
Replace the source file with the modified/updated local copy.
Delete the local copy.

Of course, if all you wish to do is to read the contents of the file of xml markup then the process is much simpler and you should not need to make a local copy before parsing it.
surlac surlacovich
Ranch Hand

Joined: Mar 12, 2013
Posts: 296

Thanks Mark, very nice use case of the StAX.
So do you agree that StAX is all-sufficient, and if one know well how to use it, he/she don't need to even know DOM and StAX? Sometimes it's just easier to use for example DOM, but if you know StAX you can easily swap DOM with StAX, and use StAX for the task.
William Brogden
Author and all-around good cowpoke
Rancher

Joined: Mar 22, 2000
Posts: 12761
    
    5
surlac surlacovich wrote:
So do you agree that StAX is all-sufficient, and if one know well how to use it, he/she don't need to even know DOM and StAX? Sometimes it's just easier to use for example DOM, but if you know StAX you can easily swap DOM with StAX, and use StAX for the task.


Not really true, any manipulations that involve more than one Element - such as changing the hierarchy or manipulating data from early in the document depending on later elements of the document would be outrageously difficult with only StAX. Those DOM manipulation methods are so powerful.

Bill
surlac surlacovich
Ranch Hand

Joined: Mar 12, 2013
Posts: 296

Thanks a lot, William!
So the algorithm should be like:

William Brogden
Author and all-around good cowpoke
Rancher

Joined: Mar 22, 2000
Posts: 12761
    
    5
Nope: The algorithm is:

1. Do I just need to pull some data items or change individual data items without messing with the XML hierarchy - StaX fine.
2. Do I need to do serious hierarchy related data manipulation - XML is not huge - DOM rules.
3. I need to to serious hierarchy related data manipulation - XML is huge - time for some serious thinking about how to simplify the job with multiple passes or getting really really deep in custom StaX programming.

Bill
Paul Clapham
Bartender

Joined: Oct 14, 2005
Posts: 18541
    
    8

I frequently find myself with an XML document from which I need to extract a small amount of data. When this happens, XPath is often a useful tool to describe and locate the nodes I need. And since DOM supports XPath naturally and StAX doesn't, I use DOM. It's all about the time to create a working program, rather than any need to conserve memory.
surlac surlacovich
Ranch Hand

Joined: Mar 12, 2013
Posts: 296

William Brogden wrote:
3. I need to to serious hierarchy related data manipulation - XML is huge - time for some serious thinking about how to simplify the job with multiple passes or getting really really deep in custom StaX programming.

Thanks Bill. I'm just not sure what you mean about multiple passes, could you please tell a little bit more about it (link to an article will work too)?

Paul, thanks for your input. So I've found out that XPath is one of XSL dialects, and it searches the nodes back and forth, thus pull/push algorithms doesn't make sense for XPath. I believe same functions provided by XPath are availiable with StAX/SAX but will take far more lines of code.
William Brogden
Author and all-around good cowpoke
Rancher

Joined: Mar 22, 2000
Posts: 12761
    
    5
Thanks Bill. I'm just not sure what you mean about multiple passes, could you please tell a little bit more about it (link to an article will work too)?


I suspect that so far in your exploration of XML you have only seen simple documents. XML documents can get really really weird, especially if they have evolved over considerable time.

I had to work with a client whose mock exam input XML had multiple types of data all related to authoring and presentation of a single certification exam simulator. This document even included chunks of CDATA which were in fact valid XML documents, sigh....

The fact that so many kinds of data could be kept in a single document is one of the strengths of XML - but may require a bit of programming.

In this case, in order to create .PDF formatted sets of questions I had to take the DOM of the big document and extract selected bits to make a temporary XML document suitable for turning into PDF.

Thats what I mean about multiple passes.

Bill
surlac surlacovich
Ranch Hand

Joined: Mar 12, 2013
Posts: 296

Sounds like every where in programming - no silver bullet, right tool for right task.
surlac surlacovich
Ranch Hand

Joined: Mar 12, 2013
Posts: 296

Can memory-mapped file be employed to make sure every edit of XML via DOM be guaranteed saved to HDD?
William Brogden
Author and all-around good cowpoke
Rancher

Joined: Mar 22, 2000
Posts: 12761
    
    5
surlac surlacovich wrote:Can memory-mapped file be employed to make sure every edit of XML via DOM be guaranteed saved to HDD?


No - just think about it for a minute.

The text length of XML elements gets changed by almost every operation. The entire DOM must be serialized to either rewrite over the file or write a new one.

Bill
surlac surlacovich
Ranch Hand

Joined: Mar 12, 2013
Posts: 296

William Brogden wrote:
The text length of XML elements gets changed by almost every operation.

Even searching of element involves DOM modification?
William Brogden
Author and all-around good cowpoke
Rancher

Joined: Mar 22, 2000
Posts: 12761
    
    5
surlac surlacovich wrote:
William Brogden wrote:
The text length of XML elements gets changed by almost every operation.

Even searching of element involves DOM modification?


No, but your question used the words "make sure every edit"
Paul Clapham
Bartender

Joined: Oct 14, 2005
Posts: 18541
    
    8

Text editors (e.g. Notepad, MS Word and so on) don't save every edit to disk immediately anyway, and nobody seems to mind that. So requiring an XML editor to save every edit to disk wouldn't really be reasonable.
surlac surlacovich
Ranch Hand

Joined: Mar 12, 2013
Posts: 296

Thanks you very much, Folks!
 
With a little knowledge, a cast iron skillet is non-stick and lasts a lifetime.
 
subject: StAX: cursor-based parsing
 
Similar Threads
Which parser does JAXB use?
XML Parsers
Anybody using StAX?
Parse XML using StaX insert into Hsql DB use maven built
Need to implement Streambased Stax parser