aspose file tools*
The moose likes XML and Related Technologies and the fly likes Dealing with bad XML Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


Win a copy of Java 8 in Action this week in the Java 8 forum!
JavaRanch » Java Forums » Engineering » XML and Related Technologies
Bookmark "Dealing with bad XML" Watch "Dealing with bad XML" New topic
Author

Dealing with bad XML

Michael Zalewski
Ranch Hand

Joined: Apr 23, 2002
Posts: 168
I have a web server (which is supplied by an outside vendor), which outputs responses in XML. The problem is that the XML is bad, and occasionally cannot be parsed by my application, which uses Xerces.
The bad XML happens for two reasons.
1) PCDATA elements can come from a database, and have embedded & and <> characters. The server should wrap these elements in CDATA tags, or use entitys, but it doesn't.
2) The XML header produced by the server claims it is UTF-8, but it is really ISO-8559. So if the XML contains any Unicode, which it occasionally does, the parser gets messed up.
My question: What can I do?
My idea is that I should be able to catch the input stream before it comes into Xerces, and fix it up. For example, I could take the header and replace the string <?xml .. encoding="UTF-8"?> and simply replace it with <?xml .. encoding="ISO-8559-1">. Since the element tags are all known and only 3 levels deep, I can escape those elements that should contain entities ('&' becomes '&', etc.)
But how do I do that? Or is there some other way?
(I know the best way would be to fix the web server that sends the bad XML. But there really is no good way to do that. I don't own the source to the servlet, and the template that this server uses does not contain very flexible string processing capabilities).
William Brogden
Author and all-around good cowpoke
Rancher

Joined: Mar 22, 2000
Posts: 12682
    
    5
I think you are right - you will have to filter it before it gets to Xerces. Fortunately the java.io classes and XML parsers lend themselves to fooling around with stream content.
You should be able to create a new Reader to handle the unicode conversion...
Bill


Java Resources at www.wbrogden.com
Roseanne Zhang
Ranch Hand

Joined: Nov 14, 2000
Posts: 1953
Use regular expression to do the text processing, fast and safe. Of course, you need to know what you are doing.
Another thing you can do is outputing the xerces error message to a file for further analyzing. That will help the text processing to do the right things.
Michael Zalewski
Ranch Hand

Joined: Apr 23, 2002
Posts: 168
Originally posted by William Brogden:
I think you are right - you will have to filter it before it gets to Xerces. Fortunately the java.io classes and XML parsers lend themselves to fooling around with stream content.
You should be able to create a new Reader to handle the unicode conversion...
Bill

I have no problem writing my own Reader. In fact, I'd *love* to use Apache HttpClient to retrieve this XML, only because HttpClient has an effective timeout mechanism. But how do I hook my reader up to Xerces?
Here is the problem. The vendor who supplied the server which writes bad XML also supplied a Java package to read and parse this XML. The java package will create a rather long URL based on. The code supplied by the vendor, which I do have access to, does this:

Can I tell the DOMParser where to stick my reader?
[ November 08, 2002: Message edited by: Michael Zalewski ]
William Brogden
Author and all-around good cowpoke
Rancher

Joined: Mar 22, 2000
Posts: 12682
    
    5
I have been thinking about that....
In the latest JAXP we have DocumentBuilder. DocumentBuilder wants an InputSource, we need to look at that class - behold! there is a constructor:
InputSource( java.io.Reader rd )
Cool!
My memory is a little vague on DOMParser - isn't there a parse() that takes a Reader?
Bill
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: Dealing with bad XML
 
Similar Threads
XPathAPI.selectSingleNode going up the document on cached element
XML Validation
Encoding Type Of XML Document
Servlet to read XML tags from Client
application.xml not validate: