Hi ppl, I have to build a parser which parses xml documents whose length may run into nearly 100000 lines. now i understand that if i use a DOM parser for this there will be tremendous runtime issues like memory usage and time while a SAX parser may not quite provide me the flexibility i wanted. The tag format of the XML is as follows
I have come up with a solution where i have a SAX parser which retrives the required child tag for me and the the DOM parser takes over and provides me with the required functionality. Please i would like ur comments on my approach or suggestions to a different approach. Devesh [ August 07, 2003: Message edited by: Devesh H Rao ]
A hybrid approach like the one you propose to use is perhaps the best way to strike the balance between functionality and performance. It is clear from your post that performance is an issue, however it is unclear how the parsed document is being used. Here are some thoughts -
If the requirements call for searching deeply embeded nodes, consider restructuring your XML document and see whether you can promote those nodes to a higher level. If the documents are being sent by a third party, you can alter the structure to suit efficient parsing by performing multiple XSLT translations before application parsing.
Consider implementing "external reader" pattern - read the document to be parsed using a standard FileReader and create multiple XML documents. You can then incrementally parse these adhoc documents.
Finally question your design - why did you endup with an XML document so large? You mention that the size can be upto 100000 lines, but how complex is the XML document structure? How deep are your nodes? Are you trying to read too much data into the document? Can you normalize the structure to make it less complex? Can you use incremental approach?
As you can see, you have to confront this issue from two different angles - arriving strategies to achieve optimal parsing at the same time revisiting your data architecture. It is very easy to abuse the use of XML in application design and make sure your reasons for using XML are well justified. Cheers,
Open Group Certified Distinguished IT Architect. Open Group Certified Master IT Architect. Sun Certified Architect (SCEA).
Hi Ajith, thank you for ur reply. i too agree with u when u say that the document is quite huge but it cannot be helped as it is a third party usage and we cannot alter the design or pattern we are supposed to use it in as is condition as it is used by others also. we in the team were looking at ways to reduce the burden on the application that we were goin to write. Your suggested second approach is new and we could try it out to see how it works. for now we have come up with the following design Application ----> XMLWrapper -------->XMLParser/s ----- > Third party XML we will be trying out various combinations at the parser level with the wrapper providing the interface to our application so that any changes in how the XML is processed does not effect the Application. Thanks for the suggestion. Regards Devesh
I think the SAX approach is definately your best bet, SAX gives you just about anything you could want it's really a matter of what -you- do with that data and how you handle it. That of course is dependent upon your needs for retrieving and using the data. If you can elaborate on what you're doing, what data you need, how you need to go about retrieving it, are you able to do it incrementally, etc. perhaps I could try and provide some suggestions.
Hi ken, ok no doubt SAX is the better approach but it gives me no control on playing with the XML as in i wont be able to update or say add to the XML. i will try to give an example i have a parent metadata xml which has some content and the related child XML/s will have data related to the metadata XML. i need control on both XML where i not only need to read but also write to the XML. The metadata XML is quite manageble but the child XML/s may run into thousands of lines with n no of tag nestings (i say n coz i the tags are dynamically generated and i do not know before hand the depth). Here we are facing a problem basically with trying to process the child coz of its size. we are trying to come up with a design which will not burden the system at the same time be portable enuf. i have given a snapshot of the design in my post above. we have thought about incrementally processing the data but no coding has been done for that approach hence i wont be able to comment on its usage infact i would like some pointers to the same. any suggestions on the same is welcome. [ August 08, 2003: Message edited by: Devesh H Rao ]
Joined: Jul 15, 2003
So you have to work with multiple XML files simultaneously that contain data related to each other, and you need the capability to not only read but write as well? Do you have to write to select parts of the existing file or are you going to be writing to a new file? Given the format of the file and use of the data, would it be possible to increment it? I understand you haven't tried, but knowing what the data is, how the file is structured, and what you need to use it for should give you an idea of whether or not it's even logically viable in the first place. The thing I would look at is whether or not you can create some helper classes and create your own data structure that does what you need, and then use the information from that to make changes to the file or write a new file. I'm currently working on a project involving XML in which the files are 'indefinately' large, I know for sure there will be files with more than 500,000 lines and I have to plan for 2,000,000 plus. SAX gives you access to just about anything you could want, find some way to store the data so that it meets your needs. For me, I could handle the data incrementally and as such I created my own "Element" class which simply represented an element and contained not only its data but all the information I would need to collect that data and to generate output, including its parent, attributes, children, and location within the file and simply stored it in a collection. When I'm done with that part of the XML I generate output, creating or appending the new XML file as needed. I don't know what your situation calls for, parsing with SAX is a given, how you handle it from there is the question. If you can do it with DOM and that meets your needs then that's the ticket, but if it doesn't then realize that you don't have to use SAX or DOM to store, access, and manipulate the data, you can create your own way of doing that and simply use SAX to get the initial information. Look into using a locator as well, you should be able to find the part of the file where the data you want to change is and change it. Anyway, I hope something in this has provided some tidbit of aid. I know I had alot of trouble finding a solution to my problem and I tried any number of designs. Nothing worked until I realized I was just going to have to do the work myself, including creating an algorithm to generate output. Surprisingly, it wasn't that difficult, so don't limit yourself. EDIT: And you might take a look at JDOM. I haven't had a chance to look into it in-depth but based upon what other's have said it might just have the flexibility you need. In fact, here's the quote:
SAX is read-only. Why don't you use JDOM, which only load data into memory when you need them, it can handle large files.
I had already solved my problem at the time and didn't have the time to look into it to see if it would work better (yet anyway) so I have no idea if it will work for you. [ August 08, 2003: Message edited by: Ken Blair ]
Thanx ken .. we are currently using SAX for the initial loading\caching purpose only and experimenting with various techniques for further processing. JDOM is something which we will look into along with the fileReader\Writer method suggested by Ajith.
Joined: Mar 17, 2000
Don't expect miracles with JDOM. It is simply a wrapper over a DOM parser that allows you to manipulate the document structure as standard Java util collections rather than native DOM interfaces. Underneath the hood, it has a DOM parser. Period. DOM level III spec talks about events and making changes to a DOM structure on the fly. If you are using parsers implementing earlier specs( presumably you are ), you can add/edit nodes and then serialize the whole document to an external file.