Sterling Crapser wrote:I've been tasked with writing code that will parse an XML file that's 2 to 4 gigs in size.
I will be parsing around 170 data elements (out of 4000) and writing them to at least 5 tables, two of which have a master/detail relationship. The existing code is parsing about 12 elements so there are blocks of code (parse, set, get) for each element. I think continuing this approach is primitive. If I added all the additional elements the method would be a mile long.
I'm thinking of a data driven approach where the element names (nodes), table names, column names and all the variable names would be stored in a table. I would read the table and load all this information into arrays and use that to reset a loop that parses a data element with each cycle.
"Leadership is nature's way of removing morons from the productive flow" - Dogbert
Articles by Winston can be found here
Sterling Crapser wrote:Instead of "hard-coding" the names of every data element I'm looking for in the code (over 160), I thought I could have a table...
"Leadership is nature's way of removing morons from the productive flow" - Dogbert
Articles by Winston can be found here
There are only two hard things in computer science: cache invalidation, naming things, and off-by-one errors
Sterling Crapser wrote:Thanks for your feedback. I appreciate this discussion for more reasons than just looking for a coding solution. The code I'm working with was written by someone who for whatever reason, was unable to parse data out of an XML file and insert it into a table any other way. Perhaps she simply didn't know how.
My company has me on a deadline and I'm brand new to Java, Eclipse, and XML. They threw me into the deep end of the pool and that's it.
They want data from the XML file...not all of it...just select pieces and have that data loaded into several tables where financial analysts retrieve the data using materialized views. They don't know or care about how everything works.
Currently they want me to write the new process in such a way that the financial analysts can ask for new data elements to be added (or removed) from the parsing process without having to go through a lot of change control bureaucracy.
We only need about 2% of the data contained in the XML file and unfortunately it's the only data source we have so we have to work with it.
The approach I'm proposing may not be the best or the simplest but because I'm so new to all this I cannot see or tell the difference. I know that hard-coding everything is fundementally wrong but that's all.
I've encountered discussions of importing XML into a database, using third-party software, discussions about XML Schemas and XSD files...all sorts of things. It's all Greek to me at this point. So I'm going with the approach of writing a Java method that will parse the data out of the file, build a SQL statement, and insert the record into a table (rinse and repeat).
I had a meeting this morning and someone suggested I consider using something called, "Collections".
"Leadership is nature's way of removing morons from the productive flow" - Dogbert
Articles by Winston can be found here
Sterling Crapser wrote:Sorry to be away from this discussion. But I have read all the replies. I will try to answer some of the questions asking what the background of the situation is
"Leadership is nature's way of removing morons from the productive flow" - Dogbert
Articles by Winston can be found here
PS: The reason I suggest DOM for parsing each parent is that then you don't have to worry about the order of your minor tags:
nested loop cycles through an array of node names looking to see if one of the stored names matches the start element's name.
Sterling Crapser wrote:The process is and will be continuing to use a StAX parser.
"Leadership is nature's way of removing morons from the productive flow" - Dogbert
Articles by Winston can be found here
Winston Gutkowski wrote:
Sterling Crapser wrote:The process is and will be continuing to use a StAX parser.
I don't think I've suggested any different. What I've suggested is that once you get down to a manageable piece of XML (hopefully, your "parent"), you use a non-sequential parser to get the data elements.
I also think that
(a) eliminating the stuff you know you don't want first (which, from what I can gather, is more than 95% of your original input).
and
(b) creating objects to do your "typed data" extraction.
is probably a more "Object-Oriented" way of looking at the problem. You can still use any number of Collection structures to actually hold those objects.
However, it also sounds like you may be a victim of unreasonable time constraints, so you're trying to come up with a "Band-Aid" solution (see my signature quote). My main worry is that the next one will be a "Band²-Aid".
Winston
Winston Gutkowski wrote:
Sterling Crapser wrote:The process is and will be continuing to use a StAX parser.
I don't think I've suggested any different. What I've suggested is that once you get down to a manageable piece of XML (hopefully, your "parent"), you use a non-sequential parser to get the data elements.
I also think that
(a) eliminating the stuff you know you don't want first (which, from what I can gather, is more than 95% of your original input).
and
(b) creating objects to do your "typed data" extraction.
is probably a more "Object-Oriented" way of looking at the problem. You can still use any number of Collection structures to actually hold those objects.
However, it also sounds like you may be a victim of unreasonable time constraints, so you're trying to come up with a "Band-Aid" solution (see my signature quote). My main worry is that the next one will be a "Band²-Aid".
Winston
Sterling Crapser wrote:I hope what I'm describing is understandable in terms of what I'm trying to do. I would appreciate feedback.
"Leadership is nature's way of removing morons from the productive flow" - Dogbert
Articles by Winston can be found here
Sterling Crapser wrote:I'm acutely aware of my limitations and have no problem declaring them to others...perhaps to a fault. But that's the way I am. The company managers do not comprehend what programming entails. They know I have years of experience working with PowerBuilder and sent me to a 5 day Java seminar. So now they think I can just sit down and knock off a bunch of code effortlessly.
I understand (in laymen's terms) what you are suggesting. But it is all over my head at the coding level. To implement any of this stuff I need to learn more and I simply do not have the time. I have a meeting tomorrow and I'm putting this all on the table. They want this thing ready for testing by the end of the month and it is simply not going to happen.
"Leadership is nature's way of removing morons from the productive flow" - Dogbert
Articles by Winston can be found here
With a little knowledge, a cast iron skillet is non-stick and lasts a lifetime. |