I'm looking for a recommendation for the best (ie., performs fastest) way to perform some simple XML parsing and updating. I imagine among the choices may be Xpath, JAXP, SAX, etc. I will have a Java String object which simply contains XML similar to the following:
I will need to access the values for the title, author and isbn tags, process these values, and then update any changed values back to the XML string. I may also need to add additional tags (ie., <publicationdate>) to the XML string as well.
Instead of me just grabbing a methodology such as XPath and running with it, I'm looking for a recommendation based on what would perform the fastest.
Thanks for the response. There will be many XML strings contained with a collection which will need to be processed. This process will be part of a Web Service request where performance is important. I'd like to keep performance in mind.
Of course if performance is only slightly faster with one methodology, but level of implementation is more difficult, I'm willing to keep my options open.
Jordan Borowski wrote:Of course if performance is only slightly faster with one methodology, but level of implementation is more difficult, I'm willing to keep my options open.
This is a much more practical point of view.
I would suggest by implementing something which isn't too difficult to do. Then consider its performance. If what you have done falls outside the performance requirements you have set, then it's time to do some profiling.
Joined: Apr 18, 2005
Thanks Paul. As far as level of difficulty for my relatively simple use case, would you recommend XPath?
Jordan Borowski wrote:Is XPath still the best library to utilize?
Compared to what?
It seems to me that newcomers to XML spend a lot of time trying to find the "best thing" before they know anything about the subject. And there's no answer to that question. Go off and spend a week or two messing about with XML and try out various things.
Joined: Apr 18, 2005
Okay will do Paul. It's been a number of years, maybe 8 since I've had to parse XML. So I was just looking for the XML library which would allow for easiest implementation based on my requirements.
XPath is just a tool. Asking whether you should use XPath is like asking whether you should use a screwdriver to build a house.
In particular XPath isn't an XML parser. It's something you would use on the result of some kinds of XML parser. Instead of fishing in the fog you might be better in spending some time reading a book about the topic. Here's a pretty good one: Processing XML with Java. It's slightly outdated but it isn't missing anything you need to know as an entry-level programmer in the field.
Joined: Apr 18, 2005
Thanks Paul for your responsiveness. I was really just trying to get a jump on best practice before I start my deep dive investigation, which I'll get working on now. I'll take a look at what you've provided.
A few years ago I was working on implementing a robust feed parser; for anyone that has done real-world feed parsing you know that following the specs is one thing, but being able to parse the mess of illegal characters and mis-used tags out there, even on some of the biggest sites, is another.
I went from using SAX, to XPath to eventually using pull-parsing which was the fastest way to parse (faster than SAX in the exhaustive testing that the Sun team did for the Pull Parser RI).
Working on the parser month after month I kept adding more and more abstractions to the parsing, realizing the reoccurring pieces of logic that could be pulled out. It eventually resulted in me created a brand new parser: SJXP (Apache 2 license)
SJXP encapsulates the ease of XPath (minus dynamic expressions) with the raw speed and low overhead of XML Pull Parsing. To speak to the raw performance, you can check out the benchmarks (it's included directly in the bundle if you want to try it yourself).
This is not a re-implementation of a core parsing library, but rather a VERY thin abstraction layer on top of one of the fastest XML parsers out there: XPP. For those that don't know, XPP is the backing implementation for XML parsing on the Android platform, so SJXP works with no dependencies on Android out of the box and on any other platform you just need to include the 1 xpp JAR.
Usage of SJXP is all based around defining "paths" pointing at elements or attributes that you want to be parsed, giving the rules to the parser along with callbacks and the parser will call your code every time the rule is matched, giving you an opportunity to do something with the information.
For example, if I wanted the title out of the snippet example you gave, I would define a rule like:
then give that rule to an XMLParser instance:
Now when I use the parser instance to parse content like that from any input source, your rule will get called for every title element. The actual character data (title of the book) will be contained in "text", and userObject is an optional reference to a user-object passed through from the parser IF you want and use it. For example, this might be a database DAO or some other storage class to hold the parsed value (even a List<String> if that is all you wanted would be fine).
All this is similar if you want an attribute value, just change the type of the rule and the overridden default method (handleParsedAttribute).
One of the biggest boons I think SJXP adds is how easy it is to support parsing elements and attributes that are qualified by a namespace. SJXP does this with  notation, more specifically: assume you were parsing an RSS feed like TechCrunch (http://techcrunch.com/feed/) and you wanted the author name out of EACH post.
If you open up the feed, you see that the author name is stored in the <item> elements in a <dc:creator> subelement. If you are familiar with prefixed elements, you know that "dc" must be a prefix defined somewhere up in the root element of the feed.
So you scroll up and look at the root element and see that "dc" is the prefix for the Dublin Core specification of tags defined by the URI "http://purl.org/dc/elements/1.1/".
So, given that, the rule you would define in SJXP to parse out the author information would look like this:
You'll notice that you just include the full URI in -notation before the element name to fully qualify it. The same even goes for prefixed attributes!
I've really tried to make the API as simple as possible and think it might offer you what you want in a much easier-to-use package with some of the fastest runtime and lowest-memory overhead performance out there.
If anyone is interested, you can read through the closed bugs on GitHub to get an idea of the optimizations that have been made that place the overhead of SJXP ontop of XPP to something like 1-2k of memory, it is really tiny because of things like using the hashcodes of the paths to match them instead of String comparisons and caching the locations in the doc as it's parsed so the hashcodes aren't recalculated all the time since XML is a structured language, this was a huge win.
I would encourage anyone skeptical to run HPROF on the Benchmark class to see SJXP in action and look at the memory and CPU allocations directly from the VM to see how tight the library is.
The project has been getting more pickup in the Android community and gotten a lot of good feedback. If any of you give it a try and have comments, please let me know!