This week's giveaway is in the Android forum.
We're giving away four copies of Android Security Essentials Live Lessons and have Godfrey Nolan on-line!
See this thread for details.
The moose likes XML and Related Technologies and the fly likes XML parsers, encoding and byte order marks Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


Win a copy of Android Security Essentials Live Lessons this week in the Android forum!
JavaRanch » Java Forums » Engineering » XML and Related Technologies
Bookmark "XML parsers, encoding and byte order marks" Watch "XML parsers, encoding and byte order marks" New topic
Author

XML parsers, encoding and byte order marks

Kelly Dolan
Ranch Hand

Joined: Jan 08, 2002
Posts: 109
I have an xml file that contains the following declaration preceded by a BOM (byte order mark) representing UTF-8:

(BOM)<?xml version="1.0" encoding="UTF-8"?>...

I need to run this file through an XML parser without modifying the file and am currently using xerces.jar (v2.6.2).

When I attempt the following I get the exception that follows. If I uncomment the 4th line, the parser succeeded. Basically, the getBOMEncoding(bis) method moves the file pointer/input stream to the first byte *after* the BOM (i.e., it skips it). My assumption: the parser doesn't recognize or like the existance of the BOM before the XML declaration.

My questions are am I doing something wrong? is there a parser/version that recognizes a BOM (although I really need to use the one I'm using)? is it documented somewhere that the parser I'm using does not support this?

Any suggestions are welcome!

DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
factory.setValidating(validate);
DocumentBuilder builder = factory.newDocumentBuilder();
BufferedInputStream bis = new BufferedInputStream(new FileInputStream(input), 5);
// getBOMEncoding(bis); // this will skip over the BOM if present; our parsers do not handle the existence of a BOM.
InputSource is = new InputSource(new InputStreamReader(bis, encoding));
is.setSystemId(input.getParentFile().toURL().toString());
result = builder.parse(is);

Exception:

org.xml.sax.SAXParseException: The markup in the document preceding the root element must be well-formed.

at org.apache.xerces.framework.XMLParser.reportError(XMLParser.java:1067)

at org.apache.xerces.framework.XMLDocumentScanner.reportFatalXMLError(XMLDocumentScanner.java:626)

at org.apache.xerces.framework.XMLDocumentScanner$XMLDeclDispatcher.dispatch(XMLDocumentScanner.java:809)

at org.apache.xerces.framework.XMLDocumentScanner.parseSome(XMLDocumentScanner.java:381)

at org.apache.xerces.framework.XMLParser.parse(XMLParser.java:952)

at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:172)

at scratch.FileOpsNewParser.parseContent(FileOpsNewParser.java:151)

at scratch.FileOpsNewParser.main(FileOpsNewParser.java:386)

org.xml.sax.SAXParseException: The markup in the document preceding the root element must be well-formed.

at org.apache.xerces.framework.XMLParser.reportError(XMLParser.java:1067)

at org.apache.xerces.framework.XMLDocumentScanner.reportFatalXMLError(XMLDocumentScanner.java:626)

at org.apache.xerces.framework.XMLDocumentScanner$XMLDeclDispatcher.dispatch(XMLDocumentScanner.java:809)

at org.apache.xerces.framework.XMLDocumentScanner.parseSome(XMLDocumentScanner.java:381)

at org.apache.xerces.framework.XMLParser.parse(XMLParser.java:952)

at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:172)

at scratch.FileOpsNewParser.parseContent(FileOpsNewParser.java:151)

at scratch.FileOpsNewParser.main(FileOpsNewParser.java:387)

org.xml.sax.SAXParseException: The markup in the document preceding the root element must be well-formed.

at org.apache.xerces.framework.XMLParser.reportError(XMLParser.java:1067)

at org.apache.xerces.framework.XMLDocumentScanner.reportFatalXMLError(XMLDocumentScanner.java:626)

at org.apache.xerces.framework.XMLDocumentScanner$XMLDeclDispatcher.dispatch(XMLDocumentScanner.java:809)

at org.apache.xerces.framework.XMLDocumentScanner.parseSome(XMLDocumentScanner.java:381)

at org.apache.xerces.framework.XMLParser.parse(XMLParser.java:952)

at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:172)

at scratch.FileOpsNewParser.parseContent(FileOpsNewParser.java:151)

at scratch.FileOpsNewParser.main(FileOpsNewParser.java:391)

org.xml.sax.SAXParseException: The markup in the document preceding the root element must be well-formed.

at org.apache.xerces.framework.XMLParser.reportError(XMLParser.java:1067)

at org.apache.xerces.framework.XMLDocumentScanner.reportFatalXMLError(XMLDocumentScanner.java:626)

at org.apache.xerces.framework.XMLDocumentScanner$XMLDeclDispatcher.dispatch(XMLDocumentScanner.java:809)

at org.apache.xerces.framework.XMLDocumentScanner.parseSome(XMLDocumentScanner.java:381)

at org.apache.xerces.framework.XMLParser.parse(XMLParser.java:952)

at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:172)

at scratch.FileOpsNewParser.parseContent(FileOpsNewParser.java:151)

at scratch.FileOpsNewParser.main(FileOpsNewParser.java:392)

org.xml.sax.SAXParseException: The markup in the document preceding the root element must be well-formed.

at org.apache.xerces.framework.XMLParser.reportError(XMLParser.java:1067)

at org.apache.xerces.framework.XMLDocumentScanner.reportFatalXMLError(XMLDocumentScanner.java:626)

at org.apache.xerces.framework.XMLDocumentScanner$XMLDeclDispatcher.dispatch(XMLDocumentScanner.java:809)

at org.apache.xerces.framework.XMLDocumentScanner.parseSome(XMLDocumentScanner.java:381)

at org.apache.xerces.framework.XMLParser.parse(XMLParser.java:952)

at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:172)

at scratch.FileOpsNewParser.parseContent(FileOpsNewParser.java:151)

at scratch.FileOpsNewParser.main(FileOpsNewParser.java:396)
Madhav Lakkapragada
Ranch Hand

Joined: Jun 03, 2000
Posts: 5040

// getBOMEncoding(bis); // this will skip over the BOM if present; our parsers do not handle the existence of a BOM.


This I believe is an inhouse method for your ContentHandler, did I understand that correctly ?
Thanks.

- m
[ September 30, 2004: Message edited by: Madhav Lakkapragada ]

Take a Minute, Donate an Hour, Change a Life
http://www.ashanet.org/workanhour/2006/?r=Javaranch_ML&a=81
Madhav Lakkapragada
Ranch Hand

Joined: Jun 03, 2000
Posts: 5040
Unless I am missing something, your best bet is to do what line 4 is already doing. Having any characters before the prolog is a well-formdness constraint and hence the fatal Exception.

Alternatly (academically speaking) you could override the fatalError message itself.

http://java.sun.com/j2se/1.4.2/docs/api/org/xml/sax/helpers/DefaultHandler.html#fatalError ( org . xml . sax . SAXParseException )

(added spaces allover so that UBB will allow me to post this link. They say its Maps fault!)


Thanks.

- m
[ September 30, 2004: Message edited by: Madhav Lakkapragada ]
Kelly Dolan
Ranch Hand

Joined: Jan 08, 2002
Posts: 109
Thanks for the reply.

I looked at the XML spec (http://www.w3.org/TR/2004/REC-xml-20040204/) and what it says about well-formed xml documents and you are correct when you say it must start with the prolog.

Unfortunately, I can no longer find the original web page that I was reading about boms, unicode and xml files. However, what can you say to Appendix F of the XML spec? This section talks about auto-detection of encoding and mentions boms. Would I be correct in saying that the XML working group recognizes the use of boms and that they would precede the prolog but that this is not a required feature to be supported - and therefore most likely why the parsers I've been playing with don't take them into account?
Kelly Dolan
Ranch Hand

Joined: Jan 08, 2002
Posts: 109
I just learned that the following works. If I simply pass the File object into the parse() method (vs. through stream and reader objects so that I could specify the encoding), it recognizes the bom and successfully parses the document.

DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
factory.setValidating(validate);
DocumentBuilder builder = factory.newDocumentBuilder();
result = builder.parse(input);
Madhav Lakkapragada
Ranch Hand

Joined: Jun 03, 2000
Posts: 5040
Glad you mentioned. Thanks for the tip on Appendix-F, never looked closely at the appendices so far. I learnt something new today.

- m
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: XML parsers, encoding and byte order marks
 
Similar Threads
Error while parsing XML with PDF attachment
Can not run Tmacat on Win2k
Errors occur when adding a new CMP entity bean in the Project
Log4j conflict on JRun4
XML Square symbol