I am trying to validate an XML file using Apache xerces-2_7_1. The encoding I am using in the XML file is UTF-8. When I have french chars in the file, I am getting "Invalid byte 2 of 2-byte UTF-8 sequence" error message. If I change the encoding to "ISO-8859-1", validation works fine, but the customer wants to use encoding UTF-8.
When I tested same file with XMLSpy, it is validating fine with UTF-8 encoding.
Can anyone tell me what I can do or what the cause is?
Here is the snipet of the code: ===================================== <?xml version="1.0" encoding="UTF-8"?> <Submission xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="layout.xsd"> =====================================
Putting <?xml version="1.0" encoding="UTF-8"?> at the start of your file says it's encoded in UTF-8, but that doesn't actually cause it to be encoded in UTF-8. The process that creates the file has to write the file in that encoding. If it produces some other encoding, it should specify that encoding in the prolog. That isn't happening in your case.
Joined: Aug 17, 2001
Thanks for your quick reply. The character it is complaining is "�" (Checked ASCII value for this and it is 201). Also XMLSpy validates this char correctly.
This is something people continually get wrong with XML. How the file is written and read matters. The first line of the XML file containg the declaration/version/charset is strictly ASCII. I don't remember the exact spec wording, but basically you are limited to the 7-bit hunk of ASCII. All bytes after that first line are strictly in the desired character set. That means if you are dealing with richer character sets you have to create a file in the way required by the spec, or things will break.
I suspect that right now what you may have is a file with a single byte for either the ASCII (131) or Unicode (201) encoding of acute capital E.
Originally posted by Suresh Kanagalingam: Hi Paul,
I checked the program to make sure it is writing standard character set to the file. I even used TextPad to type French characters using TextPad "ANSI Character" listing.
Can you please confirm that for letter "�" to be validated with UTF-8, it has to have hex value of 201?
To reiterate what Reid said, if you're seeing a hex value of 201 in your file then it isn't encoded in UTF-8. And if you used the "standard character set" to write to the file, that almost certainly wouldn't be UTF-8 anyway.
The easiest way to get your XML encoding right in Java is to use the standard XML software (whatever's built in to your JRE, or Xerces or Xalan or Saxon or some other open-source product) and to provide an output stream (not a Writer) for it to write to. The software will take care of the encoding.
Or iff you're writing XML to a file with your own ad-hoc code, then encode it in UTF-8 like this:
Reid M. Pinchback
Joined: Jan 25, 2002
And although not an issue for UTF-8, for any character set that doesn't include 7-bit ascii as a single-byte subset, you have to deal with both encodings, not just a single coding as shown above. First you output the first line in the required encoding, then everything else in the other encoding. Not something I've had to do, but suspect it comes up with Asian character sets, maybe UTF-16?
Like Paul said, doing something in a tool that understands this, like serializing DOM, is generally just much safer.
[ January 12, 2006: Message edited by: Reid M. Pinchback ] [ January 12, 2006: Message edited by: Reid M. Pinchback ]