Win a copy of Re-engineering Legacy Software this week in the Refactoring forum
or Docker in Action in the Cloud/Virtualization forum!
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

how about process large XML file(bigger than 1GB) in Java?

 
shuyi zhou
Greenhorn
Posts: 2
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Dear All,

I have to process some large XML files (bigger than 1GB per file) in Java code, which approachs will pls suggest me to use?

SAX?
StAX?

Or any other better way?
 
Walter Bernstein
Ranch Hand
Posts: 57
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Stax is a bit more developer friendly.

But before using sax/stax try dom4J with xpp parser. Maybe it can handle your data, but that depends on what you do with the file...
 
William Brogden
Author and all-around good cowpoke
Rancher
Posts: 13056
6
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I doubt very much that any DOM oriented parser will be able to handle a file bigger than 1GB and there may not be any reason to handle the entire thing in memory at one time. You left out the essential information - what has to be done to the data in this XML file??

If it is just record by record processing then event oriented (SAX or StaX) parsing will be the way to go. For record by record processing, an existing "pipeline" toolkit may be applicable.

So - more detail on what has to be done to the data please.

Bill
 
Raghavan Muthu
Ranch Hand
Posts: 3381
Mac MySQL Database Tomcat Server
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
That's very true! It surely depends on what you are intended to do with the XML data after being processed.
 
Walter Bernstein
Ranch Hand
Posts: 57
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Originally posted by William Brogden:
I doubt very much that any DOM oriented parser will be able to handle a file bigger than 1GB and there may not be any reason to handle the entire thing in memory at one time.

It worked for me with 1.2GB file, just check it. dom4j is DOM oriented, but not a real DOM parser.
 
Ilja Preuss
author
Sheriff
Posts: 14112
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Originally posted by William Brogden:
I doubt very much that any DOM oriented parser will be able to handle a file bigger than 1GB


That would very much depend on the parser and the file structure, wouldn't it?

If, for example, the file contained one and the same tag again and again, a DOM oriented parser that interned the tag names would likely have no memory problem at all.

Let alone the trivial case of an XML file containing 99% white space...
 
William Brogden
Author and all-around good cowpoke
Rancher
Posts: 13056
6
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
It worked for me with 1.2GB file, just check it. dom4j is DOM oriented, but not a real DOM parser.


That really surprised me - how much memory did you have to give the JVM?

Bill
 
Walter Bernstein
Ranch Hand
Posts: 57
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Originally posted by William Brogden:


That really surprised me - how much memory did you have to give the JVM?

Bill


750MB
 
Raees Uzhunnan
Ranch Hand
Posts: 126
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
William Brogden is right DOM takes a lot of memory. Since trying to allocate these much of memory using small objects like nodes and elements also has an impact on performance and garbage collection !.

STAX works for us since it is on demand parsing and I need to only worry about exact the data I want to see.. check it out

Thanks
Raees
 
Lolke Dijkstra
Greenhorn
Posts: 2
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi,

Have a look here: http://java.dzone.com/articles/conveniently-processing-large

You may also want to have a look at LDX+ framework for processing Big Data XML in Java. It also utilizes SAX, but uses code generation to generate the JavaBeans access to the schema complexTypes. It deals with large datasets by allowing the application programmer to configure what parts to process at runtime. It also deals with memory issues like containers.

We've got an evaluation version available for anyone who is interested in checking it out: http://xml2java.net/downloads.html. General information can be found at: http://xml2java.net

Cheers,
Lolke
 
Winston Gutkowski
Bartender
Pie
Posts: 10111
56
Eclipse IDE Hibernate Ubuntu
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Lolke Dijkstra wrote:Have a look here: http://java.dzone.com/articles/conveniently-processing-large

You do realise that you answered a thread that is 5 years old? I suspect shuyi has left the building...

However, for my two-penn'orth, I think the first question I'd be asking myself is: how did my app get into such a state that I'm having to deal with 1Gb XML files in the first place?

Winston

PS: Nice surname BTW. I bet that gets you a few interviews.
 
Lolke Dijkstra
Greenhorn
Posts: 2
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Winston Gutkowski wrote:
Lolke Dijkstra wrote:Have a look here: http://java.dzone.com/articles/conveniently-processing-large

You do realise that you answered a thread that is 5 years old? I suspect shuyi has left the building...

However, for my two-penn'orth, I think the first question I'd be asking myself is: how did my app get into such a state that I'm having to deal with 1Gb XML files in the first place?

Winston

PS: Nice surname BTW. I bet that gets you a few interviews.


Haha! Well, nice firstname ;-)
You're probably right. That does not take away the fact that I have been involved in a project (banking) where end=of-day reporting did involve parsing multi GB messages.. I find the approach that I outline here the most convenient method: http://xml2java.net/xml-java-data-mapping-big-data-article.html
Cheers,
Lolke
 
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic