| Author |
XML parsers, encoding and byte order marks
|
Kelly Dolan
Ranch Hand
Joined: Jan 08, 2002
Posts: 103
|
|
I have an xml file that contains the following declaration preceded by a BOM (byte order mark) representing UTF-8: (BOM)<?xml version="1.0" encoding="UTF-8"?>... I need to run this file through an XML parser without modifying the file and am currently using xerces.jar (v2.6.2). When I attempt the following I get the exception that follows. If I uncomment the 4th line, the parser succeeded. Basically, the getBOMEncoding(bis) method moves the file pointer/input stream to the first byte *after* the BOM (i.e., it skips it). My assumption: the parser doesn't recognize or like the existance of the BOM before the XML declaration. My questions are am I doing something wrong? is there a parser/version that recognizes a BOM (although I really need to use the one I'm using)? is it documented somewhere that the parser I'm using does not support this? Any suggestions are welcome! DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance(); factory.setValidating(validate); DocumentBuilder builder = factory.newDocumentBuilder(); BufferedInputStream bis = new BufferedInputStream(new FileInputStream(input), 5); // getBOMEncoding(bis); // this will skip over the BOM if present; our parsers do not handle the existence of a BOM. InputSource is = new InputSource(new InputStreamReader(bis, encoding)); is.setSystemId(input.getParentFile().toURL().toString()); result = builder.parse(is); Exception: org.xml.sax.SAXParseException: The markup in the document preceding the root element must be well-formed. at org.apache.xerces.framework.XMLParser.reportError(XMLParser.java:1067) at org.apache.xerces.framework.XMLDocumentScanner.reportFatalXMLError(XMLDocumentScanner.java:626) at org.apache.xerces.framework.XMLDocumentScanner$XMLDeclDispatcher.dispatch(XMLDocumentScanner.java:809) at org.apache.xerces.framework.XMLDocumentScanner.parseSome(XMLDocumentScanner.java:381) at org.apache.xerces.framework.XMLParser.parse(XMLParser.java:952) at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:172) at scratch.FileOpsNewParser.parseContent(FileOpsNewParser.java:151) at scratch.FileOpsNewParser.main(FileOpsNewParser.java:386) org.xml.sax.SAXParseException: The markup in the document preceding the root element must be well-formed. at org.apache.xerces.framework.XMLParser.reportError(XMLParser.java:1067) at org.apache.xerces.framework.XMLDocumentScanner.reportFatalXMLError(XMLDocumentScanner.java:626) at org.apache.xerces.framework.XMLDocumentScanner$XMLDeclDispatcher.dispatch(XMLDocumentScanner.java:809) at org.apache.xerces.framework.XMLDocumentScanner.parseSome(XMLDocumentScanner.java:381) at org.apache.xerces.framework.XMLParser.parse(XMLParser.java:952) at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:172) at scratch.FileOpsNewParser.parseContent(FileOpsNewParser.java:151) at scratch.FileOpsNewParser.main(FileOpsNewParser.java:387) org.xml.sax.SAXParseException: The markup in the document preceding the root element must be well-formed. at org.apache.xerces.framework.XMLParser.reportError(XMLParser.java:1067) at org.apache.xerces.framework.XMLDocumentScanner.reportFatalXMLError(XMLDocumentScanner.java:626) at org.apache.xerces.framework.XMLDocumentScanner$XMLDeclDispatcher.dispatch(XMLDocumentScanner.java:809) at org.apache.xerces.framework.XMLDocumentScanner.parseSome(XMLDocumentScanner.java:381) at org.apache.xerces.framework.XMLParser.parse(XMLParser.java:952) at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:172) at scratch.FileOpsNewParser.parseContent(FileOpsNewParser.java:151) at scratch.FileOpsNewParser.main(FileOpsNewParser.java:391) org.xml.sax.SAXParseException: The markup in the document preceding the root element must be well-formed. at org.apache.xerces.framework.XMLParser.reportError(XMLParser.java:1067) at org.apache.xerces.framework.XMLDocumentScanner.reportFatalXMLError(XMLDocumentScanner.java:626) at org.apache.xerces.framework.XMLDocumentScanner$XMLDeclDispatcher.dispatch(XMLDocumentScanner.java:809) at org.apache.xerces.framework.XMLDocumentScanner.parseSome(XMLDocumentScanner.java:381) at org.apache.xerces.framework.XMLParser.parse(XMLParser.java:952) at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:172) at scratch.FileOpsNewParser.parseContent(FileOpsNewParser.java:151) at scratch.FileOpsNewParser.main(FileOpsNewParser.java:392) org.xml.sax.SAXParseException: The markup in the document preceding the root element must be well-formed. at org.apache.xerces.framework.XMLParser.reportError(XMLParser.java:1067) at org.apache.xerces.framework.XMLDocumentScanner.reportFatalXMLError(XMLDocumentScanner.java:626) at org.apache.xerces.framework.XMLDocumentScanner$XMLDeclDispatcher.dispatch(XMLDocumentScanner.java:809) at org.apache.xerces.framework.XMLDocumentScanner.parseSome(XMLDocumentScanner.java:381) at org.apache.xerces.framework.XMLParser.parse(XMLParser.java:952) at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:172) at scratch.FileOpsNewParser.parseContent(FileOpsNewParser.java:151) at scratch.FileOpsNewParser.main(FileOpsNewParser.java:396)
|
 |
Madhav Lakkapragada
Ranch Hand
Joined: Jun 03, 2000
Posts: 5040
|
|
// getBOMEncoding(bis); // this will skip over the BOM if present; our parsers do not handle the existence of a BOM. This I believe is an inhouse method for your ContentHandler, did I understand that correctly ? Thanks. - m [ September 30, 2004: Message edited by: Madhav Lakkapragada ]
|
Take a Minute, Donate an Hour, Change a Life
http://www.ashanet.org/workanhour/2006/?r=Javaranch_ML&a=81
|
 |
Madhav Lakkapragada
Ranch Hand
Joined: Jun 03, 2000
Posts: 5040
|
|
Unless I am missing something, your best bet is to do what line 4 is already doing. Having any characters before the prolog is a well-formdness constraint and hence the fatal Exception. Alternatly (academically speaking) you could override the fatalError message itself. http://java.sun.com/j2se/1.4.2/docs/api/org/xml/sax/helpers/DefaultHandler.html#fatalError ( org . xml . sax . SAXParseException ) (added spaces allover so that UBB will allow me to post this link. They say its Maps fault!) Thanks. - m [ September 30, 2004: Message edited by: Madhav Lakkapragada ]
|
 |
Kelly Dolan
Ranch Hand
Joined: Jan 08, 2002
Posts: 103
|
|
Thanks for the reply. I looked at the XML spec (http://www.w3.org/TR/2004/REC-xml-20040204/) and what it says about well-formed xml documents and you are correct when you say it must start with the prolog. Unfortunately, I can no longer find the original web page that I was reading about boms, unicode and xml files. However, what can you say to Appendix F of the XML spec? This section talks about auto-detection of encoding and mentions boms. Would I be correct in saying that the XML working group recognizes the use of boms and that they would precede the prolog but that this is not a required feature to be supported - and therefore most likely why the parsers I've been playing with don't take them into account?
|
 |
Kelly Dolan
Ranch Hand
Joined: Jan 08, 2002
Posts: 103
|
|
I just learned that the following works. If I simply pass the File object into the parse() method (vs. through stream and reader objects so that I could specify the encoding), it recognizes the bom and successfully parses the document. DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance(); factory.setValidating(validate); DocumentBuilder builder = factory.newDocumentBuilder(); result = builder.parse(input);
|
 |
Madhav Lakkapragada
Ranch Hand
Joined: Jun 03, 2000
Posts: 5040
|
|
Glad you mentioned. Thanks for the tip on Appendix-F, never looked closely at the appendices so far. I learnt something new today. - m
|
 |
 |
|
|
subject: XML parsers, encoding and byte order marks
|
|
|