my dog learned polymorphism*
The moose likes XML and Related Technologies and the fly likes Validating XML file beforehand using SAX Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Engineering » XML and Related Technologies
Bookmark "Validating XML file beforehand using SAX" Watch "Validating XML file beforehand using SAX" New topic
Author

Validating XML file beforehand using SAX

Neeraj Dheer
Ranch Hand

Joined: Mar 30, 2005
Posts: 225
I am passing the InputStream(which is a FileInputStream) to a SAX XML Parser. The parser parses the file, breaking it into smaller fragments and passing it on to a JMS publisher. Now, the issue is that if the file is not valid XML(say ending root tag is missing), the parser still parses the file and throws an Exception only when it comes to the point where the XML is actually invalid, which means that i have already sent some messages to the publisher before realising that the file is invalid. So, to prevent that, we need to know before hand whether the file is invalid.

Now, I could always break up the file and store the fragments in a List of something and send them all at once after parsing the file, but here are the issues with it :

1. The file is normally HUGE(a Gig or more). And hence the choice of using a SAX parser to a DOM parser.

2. For the above reason, i cant really store all the fragments before sending them out. I run out of memory.

So, the only solution was to parse the file twice, time consuming, but not as hard on the memory. And hence the need to read from the same stream twice.

I have to use either SAX or DOM...preferably SAX coz I have the entire code in it. I cant use any other parser, though I would have loved to use jdom.

Any workaround to this?? Hmmm..and now this post probably doesnt sound right in this forum, does it???

Also posted here
but this sounded like the right forum. Sorry if this constitutes cross-posting.
William Brogden
Author and all-around good cowpoke
Rancher

Joined: Mar 22, 2000
Posts: 12781
    
    5
Two passes with the SAX parser sounds like the best solution. You could optimize the two passes by turning off validating for the second pass.
The first pass can keep some rudimentary debugging information such as the most recent startElement and characters data, and of course capture the line number information for any thrown exception.
Don't even think about using DOM with a huge file since a DOM takes several times the size of the file to store due to all the objects being created, 16bit Unicode, etc.
Bill
Andy Hahn
Ranch Hand

Joined: Aug 31, 2004
Posts: 225
An alternative would also be to use castor. Then you could use the validate() method before you marshal the xml. This will validate the XML against a specified schema. We do this on our project and it works great. Also, it uses SAX behind the scenes and you don't need to worry about the intricacies of the SAX API.
William Brogden
Author and all-around good cowpoke
Rancher

Joined: Mar 22, 2000
Posts: 12781
    
    5
I was under the impression that Castor always creates a single object from an XML document. Given the huge size of the document in this case, that is not going to help a bit.
The poster's problem is detecting ill-formed XML at minimum cost.
Bill
Neeraj Dheer
Ranch Hand

Joined: Mar 30, 2005
Posts: 225
Bill,

The issue with parsing twice is this:
Since I have already read from the InputStream while validatng the file, ie, parsing at the first pass, there is no way I can re-read from the same InputStream since there is no way to 'reset' the Stream to read from the beginning again.

I have already tried using 'mark' and 'reset' but those methods are not supported(that is a question for another forum).
William Brogden
Author and all-around good cowpoke
Rancher

Joined: Mar 22, 2000
Posts: 12781
    
    5
Since we are talking about a FileInputStream - just open a new one on the same file after closing the one used in the validation pass. The streams that support mark and reset are using in-memory buffers - since we are talking about huge files this is not practical.
Bill
Neeraj Dheer
Ranch Hand

Joined: Mar 30, 2005
Posts: 225
Bill,

I have tried that as well. Like i said, what my class gets is an InputStream. I dont know the name/path of the file directly. I just pass this InputStream to the parser for parsing.

Since i dont know the name/path of the file, this solution also does not work...
William Brogden
Author and all-around good cowpoke
Rancher

Joined: Mar 22, 2000
Posts: 12781
    
    5
I dont know the name/path of the file directly.

That is certainly NOT the impression I had from your original post.
Since we seem to have established that there is no way you can keep the entire file in memory, you are forced to create a copy to a file location that you can control. (See for example the java.io.File method createTempFile() that automatically uses the system temp directory.)
You could do this during the first pass of the SAX parser. One way would be by having all of the methods that receive events during parsing write the data to the temporary file. Another approach would be to write your own implementation of InputSource that writes to the temporary file as it provides a character stream to the SAX parser.
Bill
Neeraj Dheer
Ranch Hand

Joined: Mar 30, 2005
Posts: 225
Sorry Bill, if my first post was not clear...

I just get an InputStream, which I send to the parser, break up the xml file into fragments and pass it on to an MDB that publishes it.

So, apart from writing a temp file, there is no other way now, is there?

Is there no way to clone the InputStream or get the file name and path?
This is because the application will be running in a production environment and I am not sure whether the app will have rights to write to a temp file, plus if there are many requests simultaneously, the amount of hard disk space needed would be huge as well...
William Brogden
Author and all-around good cowpoke
Rancher

Joined: Mar 22, 2000
Posts: 12781
    
    5
It looks like the java.nio.channels.FileChannel class might provide a way to let you manipulate the underlying file system to reset the reading position of a FileInputStream. I have not tried to use this class so I can't say for sure. Let us know if you find a way to do it.
Bill
Neeraj Dheer
Ranch Hand

Joined: Mar 30, 2005
Posts: 225
Bill,

Cant use nio, using jdk.1.3.x
William Brogden
Author and all-around good cowpoke
Rancher

Joined: Mar 22, 2000
Posts: 12781
    
    5
Ok, we are stuck with Java 1.3 - I can't think of any option except the creation of a temp file copy that you can control. If you take a look at the FileInputStream source code you will see that all of the significant methods are native and the link to the real operating system file is hidden inside a FileDescriptor that you can't manipulate.
If anybody complains about the performance, tell them that upgrading to the latest Java would be a BIG improvement.
Bill
Neeraj Dheer
Ranch Hand

Joined: Mar 30, 2005
Posts: 225
Ben,

looks like another dead end. Will try to find out if i can write to a temp file.

on the side, is JDOM as memory hungry as DOM or would it be worth considering porting the entire code to JDOM(or some other parser?)

Thanks a lot Ben...
William Brogden
Author and all-around good cowpoke
Rancher

Joined: Mar 22, 2000
Posts: 12781
    
    5
Any approach that builds an all-in-memory structure to contain the data will be HUGE - potentially 3 or more times the size of the xml document file. Remember, Java characters are 16 bit in memory, then there will be Java objects for every part of the document.

I think you are stuck with the temp file approach.

Bill
Rajagopal Manohar
Ranch Hand

Joined: Nov 26, 2004
Posts: 183
Originally posted by Neeraj Dheer:
Ben,

looks like another dead end. Will try to find out if i can write to a temp file.

on the side, is JDOM as memory hungry as DOM or would it be worth considering porting the entire code to JDOM(or some other parser?)

Thanks a lot Ben...


In my experience with JDOM and java 1.3 I found the document objects taking up 0.9 times the file size compared to DOM which took 3.8 times

surprisingly in java 1.5 JDOM occupies more memory than DOM no idea why though the factors being around 6 and 4 respectively.
William Brogden
Author and all-around good cowpoke
Rancher

Joined: Mar 22, 2000
Posts: 12781
    
    5
Rajagopal - thanks for the real test case information. That certainly is a surprising difference.
Due to Java's "intern"-ing of Strings, I imagine that there would be a big difference between the in-memory size of XML with many repeated values versus a document with largely unique values.
Bill
Neeraj Dheer
Ranch Hand

Joined: Mar 30, 2005
Posts: 225
Bill,

Will the following work satisfactorily?

1. Since i have an InputStream, read the entire file into a StringBuffer, then convert this into a ByteArrayInputStream which i then pass as an argument to the parsing methods. Im not sure we can read frmo the same ByteArrayInputStream twice, but now we have the StringBuffer which we can read twice.
Pros: Goal achieved!!!
Cons: During the first pass, we consume twice the memory, for storing the StringBuffer as well as the ByteArrayInputStream. But before the second pass, we can null the buffer which means when the actual parsing takes place, only the ByteArrayInputStream will be held in memory.

2. If the parser accepts a Reader, I can do something like the follwoing:



where fileContent is the original FileInputStream and then pass this StringReader object to the parser.

(Edited: had not closed the 'CODE' properly.)
[ September 16, 2005: Message edited by: Neeraj Dheer ]
William Brogden
Author and all-around good cowpoke
Rancher

Joined: Mar 22, 2000
Posts: 12781
    
    5
If you have an InputStream it will read bytes, not characters. This is a GOOD thing since characters take twice the space.

One way to get the entire InputStream in memory would be to read bytes and write to a ByteArrayOutputStream. When you have reached the end you can get a byte[] with the entire contents with the toByteArray method. Unfortunately due to the way ByteArrayOutputStream handles automatic expansion by repeated reallocation of a larger byte[] and copying, you won't be able to get anywhere near the full use of memory and it will be slower due to multiple copying.

If this was my problem I would create a huge byte[] at the very start and just fill it from the InputStream. You could select the size by looking at available memory and allowing for the memory used in later processes. This byte[] can be read any number of times by creating a ByteArrayOutputStream from it.
Stick with byte[] and stay away from Strings since they take twice the space.
Bill
Rajagopal Manohar
Ranch Hand

Joined: Nov 26, 2004
Posts: 183
Since i have an InputStream, read the entire file into a StringBuffer


Basically you are trying to hold a 1 GIG file in memory

I presume it is to prevent such a scenario is why you are avoiding parsers like DOM and JDOM...

I am not sure about DOM but I think JDOM also internally holds the document as
a StringBuffer

Regards,
Rajagopal
Neeraj Dheer
Ranch Hand

Joined: Mar 30, 2005
Posts: 225
Rajagopal,

I was trying to prevent reading the entire file into memory, but now that seems to be the only option left.

By reading the entire file into memory, I am considering the worst-case scenario in terms of time and memory. Going from the posts above and based on the discussions on JDOM, the best/worst case scenarios differ on the contents of the file being parsed, in which case i may be treated to nasty surprises.

If the memory consumed(which is the primary factor here) using JDOM versus reading the file into memory is almost the same, i would want to stick to reading the file into memory and persist with my existing code because that code contains the logic for whatever i need to do whule actually parsing the file and has been tested thoroughly. Changing the code to JDOM without too much benefit would mean rewriting the entire code and giong through the test cycles again.

Hence i wouldnt want to touch that part of the code(paring the file using SAX) unless there is some really huge advantage to it.
Rajagopal Manohar
Ranch Hand

Joined: Nov 26, 2004
Posts: 183
Originally posted by Neeraj Dheer:
Rajagopal,

I was trying to prevent reading the entire file into memory, but now that seems to be the only option left.

By reading the entire file into memory, I am considering the worst-case scenario in terms of time and memory. Going from the posts above and based on the discussions on JDOM, the best/worst case scenarios differ on the contents of the file being parsed, in which case i may be treated to nasty surprises.

If the memory consumed(which is the primary factor here) using JDOM versus reading the file into memory is almost the same, i would want to stick to reading the file into memory and persist with my existing code because that code contains the logic for whatever i need to do whule actually parsing the file and has been tested thoroughly. Changing the code to JDOM without too much benefit would mean rewriting the entire code and giong through the test cycles again.

Hence i wouldnt want to touch that part of the code(paring the file using SAX) unless there is some really huge advantage to it.


Only hitch being you will have to parse a HUGE file twice in your approach. Once you use a tree model you need to parse only once
Neeraj Dheer
Ranch Hand

Joined: Mar 30, 2005
Posts: 225
Rajagopal,

Yes i understand that i will have to parse twice, which i certainly wanted to avoid.

If i use a tree model, then the size of the tree will depend on the structure/contents of the xml file. Although the structure is fixed, the contents will vary and hence the size of the tree will vary. Also, i will have to change tried and tested code in production.

Because of the above, i am, at the moment, willing to 'waste' the time required to parse the file once since i do have control over the other parameters, unless of course, like i have previously mentioned, there is a significant improvement in time/memory.

Bill, until i find anything else, i plan to stick to the above approach, reading the file first and parse it twice since that makes most sense to me. what say???
Guru Radhakrishnan
Greenhorn

Joined: Sep 13, 2005
Posts: 7
I had a similar problem and i happen to let the system behave the way it is. i.e let it throw the exception lets say if you are publishing fragmented messages 1, 2 and 3 of the same document the receiving MDB application would look for end of fragments to close the logical message group and sending application would publish the exception itself as the last fragment at the end and apply the validator at the end by regrouping the xml messages at the receiving end. Obviously it will fail and hence the message is invalid. I donot know if this would help in your current situation or not.
William Brogden
Author and all-around good cowpoke
Rancher

Joined: Mar 22, 2000
Posts: 12781
    
    5
Bill, until i find anything else, i plan to stick to the above approach, reading the file first and parse it twice since that makes most sense to me. what say???

I think that is the approach you will have to take. I bet you will find that the initial validating SAX pass is a lot faster than you expected. In any case, plese let us know how it works out.
Bill
Gernot Greimler
Greenhorn

Joined: Sep 08, 2005
Posts: 6
didn't read the whole thread, but here is how i do the validation before inserting stuff into the database:

file is a string with absolute path + filename

Neeraj Dheer
Ranch Hand

Joined: Mar 30, 2005
Posts: 225
Guru,

That is what i had originally proposed. Sending some additional information alongwith each message fragment sent. But at that time that led to other design problems and hence that idea was discarded. In my case, the receiving MDB inserts each message fragment in the database as soon as it gets it, without waiting for the entire file or anything.

Also like i have mentioned, i have control only over the parsing class and cant change the MDB code unless i have a very valid reason to. (I do have one in this case, but that means changing an entire system already in production which is the very last option.)
William Brogden
Author and all-around good cowpoke
Rancher

Joined: Mar 22, 2000
Posts: 12781
    
    5
We are talking about a "Huge" document here - "a Gig or more" - building any in-memory DOM is very likely to take multiple Gig of memory. Remember the Unicode representation takes 2 bytes per character - see Rajagopal's posts on his experiences with memory required.
Bill
Weihui Qiu
Greenhorn

Joined: Sep 01, 2005
Posts: 1
Think about this solution. (Multi-thread)

1. In the first SAX parser handler, write the same XML to a PipedOutputStream during parsing/validation.
(or write the XML fregments to several PipedOutputStreams?)
2. Connect PipedOutputStream to PipedInputStream.
3. Pass PipedInputStream to next SAX parser.

Good Luck!
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: Validating XML file beforehand using SAX