I have an xml file that contains the following declaration preceded by a BOM (byte order mark) representing UTF-8:
(BOM)<?xml version="1.0" encoding="UTF-8"?>...
I need to run this file through an XML parser without modifying the file and am currently using xerces.jar (v2.6.2).
When I attempt the following I get the exception that follows. If I uncomment the 4th line, the parser succeeded. Basically, the getBOMEncoding(bis) method moves the file pointer/input stream to the first byte *after* the BOM (i.e., it skips it). My assumption: the parser doesn't recognize or like the existance of the BOM before the XML declaration.
My questions are am I doing something wrong? is there a parser/version that recognizes a BOM (although I really need to use the one I'm using)? is it documented somewhere that the parser I'm using does not support this?
Unfortunately, I can no longer find the original web page that I was reading about boms, unicode and xml files. However, what can you say to Appendix F of the XML spec? This section talks about auto-detection of encoding and mentions boms. Would I be correct in saying that the XML working group recognizes the use of boms and that they would precede the prolog but that this is not a required feature to be supported - and therefore most likely why the parsers I've been playing with don't take them into account?
Joined: Jan 08, 2002
I just learned that the following works. If I simply pass the File object into the parse() method (vs. through stream and reader objects so that I could specify the encoding), it recognizes the bom and successfully parses the document.