• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Paul Clapham
  • Tim Cooke
  • Jeanne Boyarsky
  • Liutauras Vilda
Sheriffs:
  • Frank Carver
  • Henry Wong
  • Ron McLeod
Saloon Keepers:
  • Tim Moores
  • Frits Walraven
  • Tim Holloway
  • Stephan van Hulst
  • Carey Brown
Bartenders:
  • Al Hobbs
  • Piet Souris
  • Himai Minh

Parsing Large XML (2GB) in less than 5 seconds

 
Greenhorn
Posts: 25
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Hi,

I am working on some project which requires me to read an xml as big as 2GB in size and parse it in less than few seconds.

My requirements are:
1> Read large XML (in GB size)
2> Use XPath to locate a node and a value
3> Parse the XML within a timeout period of 5 seconds.
4> Use JDK 1.4
5> Limited memory usage (can not use DOM technique)

Currently I am using SAX and XPath combination which is giving me good results if that particular node exists. But the problem occurs when either the node does not exists and I have to parse the entire XML to find that it is not there or it exists at the very end of the XML. Parsing complete XML goes way above 5 seconds.

Any one has any idea as to what combination technique can give me this ability to read GB XML in less than few seconds with limited memory usage with JDK 1.4?
 
Author and all-around good cowpoke
Posts: 13078
6
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Yow thats a challenge!

XPath is out since it requires a DOM, if somebody is handing you a XPath you will have to write code to interpret it.

There have been LOTS of attempts to write fast parsers besides the standard library - you might find a faster one.

Does your monster XML use namespaces (hope not, so much simpler without.)

How fixed is the XML format, can we just treat this as a text search problem and ignore XML?

Bill
 
Saurabh Gokhale
Greenhorn
Posts: 25
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Hi William, thanks for replying.

Yes I am using a non standard interpreter which is giving me an ability to read the XML using SAX and implement XPath on it.

One good thing about this XML issue is, I at least dont have to worry about namespaces.

You are right, I may be able to consider it as a text search instead of considering as an XML. It will at least give me some idea whether the tag I am looking for exists in the XML or no. But I may still end up reading a same tag (same tag name) under a different parent.

Will it improve the performance if I treat this as a text search instead like an XML? How should I go about this route?

 
William Brogden
Author and all-around good cowpoke
Posts: 13078
6
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
1. Tell me more about this SAX parser that lets you work with XPath without requiring a DOM!

Will it improve the performance if I treat this as a text search instead like an XML? How should I go about this route?



Since any XML parser will have to read the file as text just to start, and then has to do parsing, plain text reading offers good possibilities.

Even better, IF you are dealing with NOTHING BUT characters in the ascii 0-127 character set, you can read byte[] and totally avoid Java's time consuming but unavoidable conversion to Unicode for Strings.

What to do next? Try time trial of reading the whole file as String lines versus byte[] blocks. Java's NIO package may offer us some goodies for converting characters but I have not tried it. Look at java.nio.CharBuffer for example.

Other speedups include devoting one Thread to reading blocks of the file and another to the scanning. Lots of cool possibilities but lets not get too complicated.

Bill
(way back at the beginning of Java I did some mind-bogglingly fast text parsing to support a legal services client so this is an area of interest to me)

 
Saurabh Gokhale
Greenhorn
Posts: 25
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Hi .. sorry .. I was travelling so could not reply...

I am using my own setup by creating stack structure to store tags and begin and end events to keep track of which tag was read and whether it matches the XPath expression.

I will definitely try reading it as a byte stream and see if it makes any time difference. Thread is also another good option rightly suggested by you but its going to be little tricky to use XPath later.
 
I am going down to the lab. Do NOT let anyone in. Not even this tiny ad:
the value of filler advertising in 2021
https://coderanch.com/t/730886/filler-advertising
reply
    Bookmark Topic Watch Topic
  • New Topic