File APIs for Java Developers
Manipulate DOC, XLS, PPT, PDF and many others from your application.
http://aspose.com/file-tools
The moose likes XML and Related Technologies and the fly likes Suggestions on Parsing Huge XML File Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Engineering » XML and Related Technologies
Bookmark "Suggestions on Parsing Huge XML File" Watch "Suggestions on Parsing Huge XML File" New topic
Author

Suggestions on Parsing Huge XML File

Sterling Crapser
Ranch Hand

Joined: Jun 05, 2006
Posts: 55
not sure what happened with this post. I first posted it and the save wasn't working so I clicked it again. I ended up with two posts. So I edited this one and the other one with the full text disappeared for some reason.

So I will try again.

I've been tasked with writing code that will parse an XML file that's 2 to 4 gigs in size. I will be parsing around 170 data elements (out of 4000) and writing them to at least 5 tables, two of which have a master/detail relationship. The existing code is parsing about 12 elements so there are blocks of code (parse, set, get) for each element. I think continuing this approach is primitive. If I added all the additional elements the method would be a mile long.

I'm thinking of a data driven approach where the element names (nodes), table names, column names and all the variable names would be stored in a table. I would read the table and load all this information into arrays and use that to reset a loop that parses a data element with each cycle. But I've only been working with Java and Eclipse for a couple months. I have 16 years of experience working with PowerBuilder (which is a lot easier to work with in my opinion). I REALLY miss the datawindow object (PowerBuilder's bread and butter).

Does anyone have any suggestions on classes I can make use of? The XML file is being streamed using StAX so it's a single pass through the data file when parsing.

Thanks
E Armitage
Rancher

Joined: Mar 17, 2012
Posts: 892
    
    9
Most databases can consume data from XML using some tool. I would investigate that option first before doing it using code.
Winston Gutkowski
Bartender

Joined: Mar 17, 2011
Posts: 8419
    
  23

Sterling Crapser wrote:I've been tasked with writing code that will parse an XML file that's 2 to 4 gigs in size.

Weil, personally, I'd question the wisdom of actually having XML files that are that big, but that's not your fault.

I will be parsing around 170 data elements (out of 4000) and writing them to at least 5 tables, two of which have a master/detail relationship. The existing code is parsing about 12 elements so there are blocks of code (parse, set, get) for each element. I think continuing this approach is primitive. If I added all the additional elements the method would be a mile long.

I'm thinking of a data driven approach where the element names (nodes), table names, column names and all the variable names would be stored in a table. I would read the table and load all this information into arrays and use that to reset a loop that parses a data element with each cycle.

So, you'd be using a table, stored in a flat file, to filter tabular data from another flat file. Do you see the absurdity?

This is just a guess, but I suspect that your original XML file should actually be a database; especially considering how enormous it is. Then, you could simply add filtration options as another table. For starters I suspect it would be a LOT smaller, since many db's offer data compression natively; secondly, it's likely to be orders of magnitude quicker.

A lot of people seem to regard XML as a panacaea, and it's not. It IS very useful for passing around contextual data, but I'm amazed that nobody thought of another approach before you got to 4 Gigs worth of it.

My 2¢

Winston

Isn't it funny how there's always time and money enough to do it WRONG?
Articles by Winston can be found here
Sterling Crapser
Ranch Hand

Joined: Jun 05, 2006
Posts: 55
This task is from my company which is a financial institution. The XML file is coming from Morningstar and is a standard method for transmitting data to banks all over the world. Criticizing the form the data is delivered in is not what I'm looking for.

Maybe I'm not describing this well. The current code uses a StAX parser to read the XML file one data element at a time. It has a loop containing a series of "IF" statements checking the value of the current element parsed. When one of the "IF" statement matches it advances the cursor in the XML file to the next element (which is the data) and sets the data to a unique variable. Later in the code it validates the data in the variable and if needed converts it to a number, string, date, etc. A second method uses getters to build the SQL statement that will insert a complete record into a table. Right now there is a set of code blocks for each data element being sought. If I continue this approach the amount of code will be enormous.

Instead of "hard-coding" the names of every data element I'm looking for in the code (over 160), I thought I could have a table containing all the element names, variable names, table names, etc. and load that information into an array (or maybe several) and write a generic loop that gets loaded with the information from the array(s) before it cycles. It could then see if the current values it is loaded with matches the current element being read in the XML file and if it doesn't, load the next set of values from the array(s) and cycle again until there's a match. To do this there would be two loops (one nested inside the other). The first loop parses the data elements one at a time, the nested loop would be for checking and processing the data if a match is found.

I don't see how this is absurd...just a bit sophisticated if anything. Not knowing Java classes all that well I thought someone here might know of what I could use and make a suggestion. I'm not doing anything with a flat file except reading one (the XML data file).
Winston Gutkowski
Bartender

Joined: Mar 17, 2011
Posts: 8419
    
  23

Sterling Crapser wrote:Instead of "hard-coding" the names of every data element I'm looking for in the code (over 160), I thought I could have a table...

You can do that; and I certainly didn't mean to imply that you couldn't. So if your question is simply: 'can I put my filtration data in a file?', then of course the answer is: Yes. And you can also load that data into any form of Java collection that makes sense to facilitate the operation.

But it sounds to me as if you're then building an app to access/filter data that's not in a particularly accessible form to begin with. If this is actually load data, then couldn't the same operation be done once it's been added to the database?

If not, then your approach sounds like a reasonable alternative; but if it could be done from the database rather than an XML stream, then I'd say that the best place for your selection information is in that database. Then your filter simply becomes a set of SELECT statements with table joins.

And even if it's going to power an XML filter, it still might be best to put your selection criteria in a database rather than a flat file. From what you said, it sounds as if it may be fairly involved, with possibly some structure to it, so you may also find it easier to document and/or validate if you do it that way.
And don't get too hung up on the word "database" - there are plenty of 'mini' databases around (JavaDB being just one) which you can even load and incorporate into a jar.

I suspect it's likely to boil down to questions like:
  • How complex is your selection data?
  • How often does it change?
  • How important is it that it's correct?
  • If the answers are: 'very', 'often' and 'very', then I'd say a database is your best bet; If the answers are: 'simple', 'almost never' and 'not very', then a flat file will probably be fine.

    However, the basic operation (as opposed to your question) doesn't sound great. It suggests to me that you're using an XML stream as a database, which just feels wrong.

    Winston
    Sterling Crapser
    Ranch Hand

    Joined: Jun 05, 2006
    Posts: 55
    Thanks for your feedback. I appreciate this discussion for more reasons than just looking for a coding solution. The code I'm working with was written by someone who for whatever reason, was unable to parse data out of an XML file and insert it into a table any other way. Perhaps she simply didn't know how.

    My company has me on a deadline and I'm brand new to Java, Eclipse, and XML. They threw me into the deep end of the pool and that's it. I know there are better ways to do things but I have very little time to explore and learn. So instead I have to work with what's in front of me. It's very frustrating, especially when the decision makers are not programmers themselves.

    They want data from the XML file...not all of it...just select pieces and have that data loaded into several tables where financial analysts retrieve the data using materialized views. They don't know or care about how everything works. They don't understand that what a person can say in a paragraph of layman's terms can translate into a very sophisticated coding solution requiring a lot of knowledge and expertise. Currently they want me to write the new process in such a way that the financial analysts can ask for new data elements to be added (or removed) from the parsing process without having to go through a lot of change control bureaucracy.

    We only need about 2% of the data contained in the XML file and unfortunately it's the only data source we have so we have to work with it. The approach I'm proposing may not be the best or the simplest but because I'm so new to all this I cannot see or tell the difference. I know that hard-coding everything is fundementally wrong but that's all. I've encountered discussions of importing XML into a database, using third-party software, discussions about XML Schemas and XSD files...all sorts of things. It's all Greek to me at this point. So I'm going with the approach of writing a Java method that will parse the data out of the file, build a SQL statement, and insert the record into a table (rinse and repeat).

    I had a meeting this morning and someone suggested I consider using something called, "Collections". I never heard of this until this morning. I have no idea what it is. It's very difficult to do a job where you have very little understanding of how everything works...just a vague notion and some common sense.
    fred rosenberger
    lowercase baba
    Bartender

    Joined: Oct 02, 2003
    Posts: 11498
        
      16

    In Java (and other languages) is a generic term for something that holds a bunch of things. The idea is you try an abstract away from the details. It doesn't matter if it is an array, a linked list, a stack, or some other data structure. The idea is you have a bunch of things, you can generally add something to it, take something out of it, (possibly) search to see if something is in there, etc.

    If someone's advice was "use a collection", that is about as helpful as someone saying "use a vehicle". Sure, If I need to get from 'a' to 'b' using a vehicle is a good idea. But what vehicle to use is going to depend on whether I'm going next door, 10 miles to work, 3000 miles to China, or 200,000 miles to the moon.


    There are only two hard things in computer science: cache invalidation, naming things, and off-by-one errors
    Campbell Ritchie
    Sheriff

    Joined: Oct 13, 2005
    Posts: 40052
        
      28
    Too difficult a question for “beginning”. Moving: let's try the XML forum.
    Winston Gutkowski
    Bartender

    Joined: Mar 17, 2011
    Posts: 8419
        
      23

    Sterling Crapser wrote:Thanks for your feedback. I appreciate this discussion for more reasons than just looking for a coding solution. The code I'm working with was written by someone who for whatever reason, was unable to parse data out of an XML file and insert it into a table any other way. Perhaps she simply didn't know how.

    Or perhaps, like you, they didn't have enough time. That's what my signature quote is all about.

    My company has me on a deadline and I'm brand new to Java, Eclipse, and XML. They threw me into the deep end of the pool and that's it.

    Ooof. My condolences.

    They want data from the XML file...not all of it...just select pieces and have that data loaded into several tables where financial analysts retrieve the data using materialized views. They don't know or care about how everything works.

    Double ooof, since it would appear that they're not only telling what they want, but how to do it, which is a recipe for disaster in my book.

    Currently they want me to write the new process in such a way that the financial analysts can ask for new data elements to be added (or removed) from the parsing process without having to go through a lot of change control bureaucracy.

    Aha! I thought so. More on that later.

    We only need about 2% of the data contained in the XML file and unfortunately it's the only data source we have so we have to work with it.

    And this is my basic question: WHY is it the only data source? Does everyone (or do all your apps) use it?

    The approach I'm proposing may not be the best or the simplest but because I'm so new to all this I cannot see or tell the difference. I know that hard-coding everything is fundementally wrong but that's all.

    From the sound of it, your instincts are just fine; but you're getting your instructions from a bunch of micro-managing headless chickens.

    I've encountered discussions of importing XML into a database, using third-party software, discussions about XML Schemas and XSD files...all sorts of things. It's all Greek to me at this point. So I'm going with the approach of writing a Java method that will parse the data out of the file, build a SQL statement, and insert the record into a table (rinse and repeat).

    So it sounds as if this is actually an import procedure to plough this data into some sort of database anyway - just not your main one - so they're forcing you to duplicate pretty much the entire import process just for them, with the addition of a complex selection procedure at the front end to boot.

    Personally, I call that a "rogue" application.

    I had a meeting this morning and someone suggested I consider using something called, "Collections".

    Well, the tutorials for the Java Collections framework are here. Basically there are three main types of structure for holding "things" (ie, Java objects): Lists, Sets and Maps, and you should definitely read up on them.

    However, before you go too far down that road: Is there someone - a mentor, perhaps - in your department you could talk to? I'm presuming you work in IT; not directly for the people that are asking for this.

    It sounds to me - and your statement above would appear to confirm it - as if someone has dreamt up this "operation" to bypass the normal chain of command. Is this really something your IT department (not to mention your administration) wants to encourage? What happens when the next bunch of renegades decide they want their own "world", free from the annoyance of change control? Another "little" select/import to shut them up?

    Feel free to point them to this thread (or just my questions) if you like, because I fear you're heading down the road of a lot of "wheel-reinvention" - which is precisely what change control was designed to prevent.

    HIH

    Winston

    PS: I will be around on Monday until about 10AM EST if you have any questions; otherwise I'll be back on Friday. Good luck. I suspect you're going to need it.
    William Brogden
    Author and all-around good cowpoke
    Rancher

    Joined: Mar 22, 2000
    Posts: 12835
        
        5
    Just speculating because we know nothing of the internal hierarchy of your input XML.

    To provide an open ended way of mapping XML Element names to processing instructions, how about a Map where the key is the Element name and the value provides handling instructions.

    For example the value could be a java.lang.reflect.Method, or the value could give an int used in a switch statement to select the processing of that Element.

    For a given run, the Map would contain only the entries for the Elements needed, thus avoiding that long series of IF statements.

    Bill
    Sterling Crapser
    Ranch Hand

    Joined: Jun 05, 2006
    Posts: 55
    Sorry to be away from this discussion. But I have read all the replies. I will try to answer some of the questions asking what the background of the situation is and then ask my own.

    The data file comes from Morningstar. I don't know much about who Morningstar is, just that they supply real time financial data to hundreds if not thousands of financial institutions all over the world. They also rank investments. If you've dealt with investing you have probably heard of a "Morningstar Rating" (usually displayed as one to five stars).

    What is happening is this...Morningstar currently sends my company roughly 50,000 (yes, that's 50K) data files over the course of each month. They are consolidating the data into massive XML data files and expect us to deal with it. So we have this huge file with data records for maybe 20,000 investments of which we only need a small fraction of the data for our needs. The first massive XML file has been "live" now for about 9 months and the person I mentioned previously who wrote some parsing code (a contractor) built a method that has been piggy-backed into the existing Java application that used to deal with all those 50K files.

    Now they want me to expand on what she wrote. Over the past few months I have learned they originally wanted her to come up with something like what I have been describing but she told them she couldn't. She implemented a StAX parser and put it into a loop that moves the cursor through the file one event at a time. She has the XML node names hard-coded (as class variables) and a series of IF statements (one for each node being sought) that are triggered when the curser hits a start element. When the cursor hits an end element, the data is validated. In some cases it needs to be cast as a date or number so the Oracle tables will accept it. There is also some data that gets interpreted so there is also that sort of logic to deal with.

    If I followed her example, the method would end up a mile long. When the company managers realized what she created they said they definitely want something more generic. So I came up with the idea of using arrays and nested loops. The main loop moves the cursor through the file and the nested loops load a new set of parameters into variables from the arrays and compare each set of variables against where the cursor is and if a match is encountered the data gets processed. All the variable data will be stored in Oracle tables and loaded into the arrays prior to the loop processing start. This is a very primitive description of the process but I think it describes the idea.

    The hierarchy of the XML file is about 5 layers deep. There are some instances where the same node name is used at the bottom of a hierarchy so I must check for parent nodes to look for the specific data I need. For example, I might encounter something like this (note: there should be indents for the Name and Id nodes):

    <MorningstarCategory>
    <Name>data</Name>
    <Id>data</Id>
    </MorningstarCategory>

    The Name and Id data are what I need to parse but there are other parent nodes that also contain Name and Id data that I don't want. I think this should be easy enough to handle. But I'm getting away from the main topic here.

    At a recent meeting I stated I was looking at using Arrays and one of the managers told me to look into using "Collections". At the time I didn't know what they were but now I know an Array is part of the Collections which tells me the manager doesn't know Java.

    I think if I can pull this off I will end up with some fairly sophisticated code in terms of how it works but should be easy to maintain down the road. If there is a simpler solution I would be glad to hear it but I do not think the company is interested in importing the entire data file into a table and then querying the table for the data they need. I don't know why...the place has a lot of querks (I've been there for over 12 years supporting PowerBuilder apps).

    I'm thinking I may be able to make use of HashMaps (maybe) and will likely need at least one two-dimensional array (an array of arrays).

    There are two other huge XML files Morningstar will start sending us and the company is hoping this solution will be able to process all of them so the definition tables will need to include the data file names as well.
    Winston Gutkowski
    Bartender

    Joined: Mar 17, 2011
    Posts: 8419
        
      23

    Sterling Crapser wrote:Sorry to be away from this discussion. But I have read all the replies. I will try to answer some of the questions asking what the background of the situation is

    Which you've done very well; hence the +1.

    So, it would appear that what you have is a parameterized extract procedure; you just have quite a lot of parameters.

    I guess the next question is: is there any context to this extract process? For example, are you likely to need 'Name' and 'ID' for a "Customer", and 'Current Balance' for an "Account"? Or is it just 'Name' and 'ID' for everything, but you just need to eliminate the "parents" that you're not interested in?

    Either way, looking through your post, I'm wondering if there might be a way to do this in two stages:
    1. Eliminate "parents" you definitely don't need.
    2. Parse the remainder to extract the data you want.

    Assuming it's feasible, I could imagine the output of Step 1 being another XML stream with the same hierarchy as your original, but with all superfluous "parents" removed, leaving your actual extract with a lot less data to parse - and also possibly less conditional logic, since it can assume that what remains will have data to extract.

    If there is some context to this extract process, then I suspect I'd be looking at a class (or an interface) that contains some smarts about what to look for (ie, the parent "name") along with the 'data' to be extracted.

    Then your Step 2 process might be something like this:
    // a bunch of "parent" objects
    get the next "parent" tag from your XML
    find the "parent" object associated with that tag (eg, '<customer>')
    pass it the XML stream and have it:
  • pull the data it needs from the stream
  • advance to its terminating tag (eg, '</customer>')
  • I'm afraid I'm not too up on the mechanics of StAX, but I suspect it may well be possible to simply extract the text for each parent, and treat that as a self-contained piece of XML - ie, pass IT off to another StAX (or - possibly even better - DOM) parser to get the data.

    Now, how you store that data is another matter; but I suspect you could have a fairly generic "bean-like" interface that all your data objects share. Alternatively, you may be able to just convert the data directly into INSERT statements for your database.

    However, you could certainly store the information about what parent tags to look for, and what data to extract from them in an external file and use that to create your "parent" objects.

    Hope it makes sense. I suspect that something like that might save you a lot of those enormous 'if' stacks.

    If you have any questions, feel free to ask.

    Winston

    PS: The reason I suggest DOM for parsing each parent is that then you don't have to worry about the order of your minor tags:
    <Name>data</Name>
    <Id>data</Id>

    and
    <Id>data</Id>
    <Name>data</Name>

    will both work. And hopefully, the size is small enough that you don't worry about a bit of inefficiency.
    kri shan
    Ranch Hand

    Joined: Apr 08, 2004
    Posts: 1382
    PS: The reason I suggest DOM for parsing each parent is that then you don't have to worry about the order of your minor tags:


    DOM takes whole lot of memory for big XML and not recommended for big XML. Try Stax parser.
    Sterling Crapser
    Ranch Hand

    Joined: Jun 05, 2006
    Posts: 55
    The process is and will be continuing to use a StAX parser. The main loop uses the XMLEventReader to step through the elements within the XML data. When a start element is encountered, a nested loop cycles through an array of node names looking to see if one of the stored names matches the start element's name. If so, the code would get the data and store it in a HashMap (I think). Once the code has passed through all elements available in a single record, it then validates the data, builds an SQL statement and inserts that record into the database table. Then it all repeats. This is a simple explanation of what I'm trying to build. There will be some complexity dealing with data that needs to be interpreted or has more complex validation routines. Some of the data needs to be written to more than one table so I need to come up with flags to build more than one SQL during a single cycle of gathering the data for one record.
    William Brogden
    Author and all-around good cowpoke
    Rancher

    Joined: Mar 22, 2000
    Posts: 12835
        
        5
    nested loop cycles through an array of node names looking to see if one of the stored names matches the start element's name.


    OR you could see if the name is in a HashSet or HashMap like I suggested before. With a HashMap you could retrieve instructions as to what to do next.


    Bill
    Winston Gutkowski
    Bartender

    Joined: Mar 17, 2011
    Posts: 8419
        
      23

    Sterling Crapser wrote:The process is and will be continuing to use a StAX parser.

    I don't think I've suggested any different. What I've suggested is that once you get down to a manageable piece of XML (hopefully, your "parent"), you use a non-sequential parser to get the data elements.

    I also think that
    (a) eliminating the stuff you know you don't want first (which, from what I can gather, is more than 95% of your original input).
    and
    (b) creating objects to do your "typed data" extraction.
    is probably a more "Object-Oriented" way of looking at the problem. You can still use any number of Collection structures to actually hold those objects.

    However, it also sounds like you may be a victim of unreasonable time constraints, so you're trying to come up with a "Band-Aid" solution (see my signature quote). My main worry is that the next one will be a "Band²-Aid".

    Winston
    Sterling Crapser
    Ranch Hand

    Joined: Jun 05, 2006
    Posts: 55
    Winston Gutkowski wrote:
    Sterling Crapser wrote:The process is and will be continuing to use a StAX parser.

    I don't think I've suggested any different. What I've suggested is that once you get down to a manageable piece of XML (hopefully, your "parent"), you use a non-sequential parser to get the data elements.

    I also think that
    (a) eliminating the stuff you know you don't want first (which, from what I can gather, is more than 95% of your original input).
    and
    (b) creating objects to do your "typed data" extraction.
    is probably a more "Object-Oriented" way of looking at the problem. You can still use any number of Collection structures to actually hold those objects.

    However, it also sounds like you may be a victim of unreasonable time constraints, so you're trying to come up with a "Band-Aid" solution (see my signature quote). My main worry is that the next one will be a "Band²-Aid".

    Winston


    You are correct in that I have little time to work with. And what little time I have is being consumed trying to understand (a) how to implement using Java and (b) if what I am pursuing is the simplest or best approach given my constraints and knowledge. My experience is with PowerBuilder which is a RAD tool for the most part. It has a lot of "canned" objects that simplify development but also limit the possibilities. Java seems to be more broken out into many small pieces which makes it much more flexible (9 ways to skin a cat) but also more complicated to learn and implement. Everything is done via coding.

    If I was working in PowerBuilder, I would create a "structure" object which is essentially an array of arrays. The structure is created via a GUI...I don't have to write code to create it. The program displays the structure as a grid of columns and rows. I enter the name of the first element, then tab to the next column which has a drop down list of available datatypes and pick one. If it needs a size and/or precision, the next column is used for designating that. Once done I use canned events that enable me to insert data via index numbers.

    I guess I'm trying to understand if something similar is available within Java (in concept). I thought I could create an array of arrays but this doesn't seem to be very common so there is little discussion of it online. Using a HashMap seems like a good idea but it's only good for one set of data (by type). Perhaps I could use several HashMaps (one for each datatype) but I'm not that familiar with Java to know if this is more work than it's worth.
    Sterling Crapser
    Ranch Hand

    Joined: Jun 05, 2006
    Posts: 55
    I'm replying again to keep this separate from my other replies.

    I have learned today that an array of arrays is not going to work. I need to create a custom class (object) and assign it to an arrayList of type "object". The object would represent a single row of data from the Oracle table I want to read and load into the array so I can loop through the array in my code.

    I will try to write this out so it makes sense....

    Note: Assume I have created a table containing parsing parameters and other processing flags/values needed for overall parsing job. Also assume I have created a custom class with private variables that represent the columns of a single row of data within the table.

    1. Initialize an arrayList of type "object" and assign the custom object.
    2. Query the table to produce a resultSet.
    3. Read the resultSet into the arrayList of my custom object.
    4. Create a nested loop within the main loop of code that is parsing the XML data file.
    5. When the main loop encounters an XML startElement, it gets the name of the startEelement (aka node) and loops the nested loop holding the array looking for a match between the startElement name and the values within the arrayList/custom object. When a match is found, use the rest of the data within the same array index to process the ensuing XML data.

    I hope what I'm describing is understandable in terms of what I'm trying to do. I would appreciate feedback.

    Here's what I have yet to learn:

    1. How to read the resultSet and get the data
    2. How to insert each piece of data from the resultSet into the arrayList/object (and make sure it is all indexed correctly so it reflects how the data is found in the table to begin with).
    3. How to loop through the arrayList/object.

    It's very challenging to learn Java and implement it simultaneously! I wonder if this is the norm.
    Flaz Four
    Greenhorn

    Joined: Nov 08, 2013
    Posts: 6

    Winston Gutkowski wrote:
    Sterling Crapser wrote:The process is and will be continuing to use a StAX parser.

    I don't think I've suggested any different. What I've suggested is that once you get down to a manageable piece of XML (hopefully, your "parent"), you use a non-sequential parser to get the data elements.

    I also think that
    (a) eliminating the stuff you know you don't want first (which, from what I can gather, is more than 95% of your original input).
    and
    (b) creating objects to do your "typed data" extraction.
    is probably a more "Object-Oriented" way of looking at the problem. You can still use any number of Collection structures to actually hold those objects.

    However, it also sounds like you may be a victim of unreasonable time constraints, so you're trying to come up with a "Band-Aid" solution (see my signature quote). My main worry is that the next one will be a "Band²-Aid".

    Winston


    I am joining to the idea of eliminating useless stuff from the original XML file. As for me the most appropriate way is XSLT templates. Then it is possible to deal with more simple XML file(s). So you can parse this file(s) and load data into collections (Maps, Arrays etc). Then you can walk through these collections, compare values of elements, process them and compose SQL queries.

    From my perspective there are also some drawbacks in this approach.
    It is not easy to implement logic in XSLT. So if there are some complex conditions it is easier to implement it in Java.
    Other drawback is that XSLT templates can become difficult to maintain.



    Karthik Jayachandran
    Ranch Hand

    Joined: Feb 18, 2009
    Posts: 88

    We encountered a similar scenario - to upload a xml data of about 1GB+ data.

    Please ignore if it doesn't suit for your application. Our application uses struts framework and ms sql server as database.

    Doing the parsing in java is slow and may end in out-of-memory exception. So we did the following,
    1. Mapped a shared-folder of database server to the application server(its an intranet web application).
    2. Uploaded the file directly to that shared-folder.
    3. Execute a procedure which reads the file and dumps the whole data in a temp table.
    4. Execute another procedure which splits(cleans) data as per master/detail.
    5. Further processes...

    Allowing the database to parse the huge file is bit faster than doing in the application/web server.
    Winston Gutkowski
    Bartender

    Joined: Mar 17, 2011
    Posts: 8419
        
      23

    Sterling Crapser wrote:I hope what I'm describing is understandable in terms of what I'm trying to do. I would appreciate feedback.

    Sorry for delay in replying, but I've been a bit busy.

    My main feedback is that you're focusing too much right now on the mechanics of the problem, so your solution is based on how you're going to code it, rather than what its design should be.

    Let me see if I've got the essentials right:
  • You have an enormous XML file that you need to parse.
  • You're only interested in 2-3% of the data that it actually contains.
  • The data that you ARE interested in represents data objects that you want to be able to configure externally.
  • The data for those objects is contained in a portion of the XML structure (your "parent" entity).
  • You already have an existing StAX-based program that can read and identify those "parents", but currently does its "dispatch" via large IF statements which are likely to get a lot bigger if you continue with the same design.
  • You are new to Java.
  • You're under a lot of time pressure.

  • Let's deal with the last two of those items first:
    1. The problem may simply be beyond your current capabilities. It's a tough thing to admit (I've been there), but sometimes it's the best thing to do and, while you may think of it as "failure", your management might thank you in the long run (or not; management can be remarkably ungrateful bar stewards ). Understanding your own limitations - or indeed, those of the people who are asking you to do things - is part and parcel of being a good programmer.
    2. If the above is a non-starter, try to get your bosses to understand that this is NOT a simple problem - refactoring rarely is - and that what they're asking for, or the timeline they've given you to do it, is likely to lead to a bad - or incomplete - solution.

    There's an old chestnut about software that's worth repeating:
    Good...Fast...Cheap - pick any two.
    and you're hamstrung by the "cheap" qualifier; so one of the other two is going to suffer - and I suspect right now, it's going to be the first.

    So let's look at the other points:
    1. Enormous file - Since (I presume) your StAX program can already pick out what you want, why not just use it to eliminate the things you don't want and, instead of those enormous IF statements, have it use "parent" tags stored in an external file. Personally, I'd separate this step out into a completely separate process that simply does that.
    2. Data objects - Again I'm presuming, but if what you want to extract is geared by the "parent", then create a class that is keyed by the parent tag and contains the minor tags (Name, ID, whatever...) that are significant to that parent. You could also add these to your external file.
    3. Extraction - If you did Step 1 right, then I would expect input to be a vastly reduced piece of XML containing ONLY those parents you want. This could be another StAX module, very similar to your original, that simply reads "parents", but since you now know that ALL parents contain data you want, the only thing to do is to work out which data object they represent, and once you have the correct one, you use its minor tags to actually extract the data.

    Note that Step 1 doesn't need to write all that XML back out to disk. It can simply be a separate Java class that spits out a "reduced" XML stream that Step 3 then reads, in the same way that a script might pipe data to another script.

    Note also that in coming up with the solution, I haven't even considered Java; I've simply looked at what I need to do.

    Now, it may not be exactly what you want, but I strongly urge you to think before you code.

    Winston
    Sterling Crapser
    Ranch Hand

    Joined: Jun 05, 2006
    Posts: 55
    Thank you Winston (and others) for your feedback.

    Your first two points are spot-on. I am in deep over my head. Everything I touch I need to learn the what, when and why of before I can move forward. There is so much about Java that is simply foreign to me at this stage that I feel like I'm wading through quicksand. I'm acutely aware of my limitations and have no problem declaring them to others...perhaps to a fault. But that's the way I am. The company managers do not comprehend what programming entails. They know I have years of experience working with PowerBuilder and sent me to a 5 day Java seminar. So now they think I can just sit down and knock off a bunch of code effortlessly.

    The company I work for is a not-for-profit financial institution (50 billion in assets). There are roughly 10 experienced Java developers and a half dozen of us legacy developers (PowerBuilder and Visual Basic) making the transition to Java. The IT department has about 100 people on staff. We have a mainframe, web services, a website, an Oracle database, off-site servers, disaster recovery, phone menus tied to applications and telephone customer service...you name it. There's a lot of politics within the company and literally no standards or protocols so what you do and how you do it is in the eye of the beholder. My manager has no programming experience at all. His background is MBA. When I express my challenges, the standard reply I get is, "Nobody else is telling me they are having trouble so you need to hunker down and get it done." What he doesn't realize is that all of us who are making the transition are struggling but I'm the only one who is vocal about it. Privately the others have all confided in me they are at a loss at to what they are doing just as I am. Most of us are using existing code, reverse engineering it, and using what we can to more or less clone solutions. But this project goes way beyond being able to do anything like that.

    I want to learn Java development. I want to understand what I'm doing and why. But the company has no time for such foolishness. I'm supposed to pull a rabbit out of thin air!

    I understand (in laymen's terms) what you are suggesting. But it is all over my head at the coding level. To implement any of this stuff I need to learn more and I simply do not have the time. I have a meeting tomorrow and I'm putting this all on the table. They want this thing ready for testing by the end of the month and it is simply not going to happen. At best, all I can do is extend the existing code (which is ridiculous I know) with the time I have. But it is what it is.
    Winston Gutkowski
    Bartender

    Joined: Mar 17, 2011
    Posts: 8419
        
      23

    Sterling Crapser wrote:I'm acutely aware of my limitations and have no problem declaring them to others...perhaps to a fault. But that's the way I am. The company managers do not comprehend what programming entails. They know I have years of experience working with PowerBuilder and sent me to a 5 day Java seminar. So now they think I can just sit down and knock off a bunch of code effortlessly.

    Then you might want to point them to this page by Peter Norvig (or print it out for your meeting ). You might also want to quote them the "Good...Fast...Cheap" aphorism - management love soundbytes .

    I've never worked specifically with PowerBuilder, but I have worked with tools like it; and there IS a big initial learning hump in going "back" to a language like Java, where everything (initially) seems to be in bits and bytes.

    After a while, you'll discover that it provides you with the tools to create your own "builder" - by which I mean high-level - objects; and, once you do get over that hump, you'll probably proceed quicker than guys who are completely new to programming, because the development lifecycle isn't that much different. However, that process takes time, which is the one thing that you don't seem to have.

    I understand (in laymen's terms) what you are suggesting. But it is all over my head at the coding level. To implement any of this stuff I need to learn more and I simply do not have the time. I have a meeting tomorrow and I'm putting this all on the table. They want this thing ready for testing by the end of the month and it is simply not going to happen.

    Then the only other thing I can suggest is that you lay it out clearly and honestly for them. I'd also suggest that you read the existing program until you understand it backwards, forwards and sideways, since:
    (a) You're likely to have to change it.
    (b) It should give you a lot of pointers as to how to write something similar in Java.
    Also: Break up the problem in your mind as much as you can, and write out the steps in detail in pseudo-code (or plain English) before you write one line of Java code.

    Good luck. It sounds like you're going to need it.

    Winston
    Sterling Crapser
    Ranch Hand

    Joined: Jun 05, 2006
    Posts: 55
    Thanks again for your feedback. I have managed to get the message across that this is new territory for me. I have been given more time and they understand what I'm dealing with. But they still want the same solution ultimately which means I must figure out a way to program it. I also have a couple experienced Java developers who can help. How much they can help depends on what they have worked with.

    I'm focusing on learning what I need to understand for this project only and ignoring things that will not apply. I guess that's how we all learn in the long run. For example, I know absolute nothing about working with GUI objects or web services right now. What I'm doing is all back-end processing.

    So now I have a question in the "Java in General" section about Iterators if you care to take a look. I never heard of Iterators until Java. Previously it was always an array index in a FOR loop.

    Never a dull moment!
     
    I agree. Here's the link: http://aspose.com/file-tools
     
    subject: Suggestions on Parsing Huge XML File