Two Laptop Bag*
The moose likes XML and Related Technologies and the fly likes XML Validation Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


Win a copy of Murach's Java Servlets and JSP this week in the Servlets forum!
JavaRanch » Java Forums » Engineering » XML and Related Technologies
Bookmark "XML Validation" Watch "XML Validation" New topic
Author

XML Validation

Suresh Kanagalingam
Ranch Hand

Joined: Aug 17, 2001
Posts: 82
Hello,

I am trying to validate an XML file using Apache xerces-2_7_1. The encoding I am using in the XML file is UTF-8. When I have french chars in the file, I am getting "Invalid byte 2 of 2-byte UTF-8 sequence" error message. If I change the encoding to "ISO-8859-1", validation works fine, but the customer wants to use encoding UTF-8.

When I tested same file with XMLSpy, it is validating fine with UTF-8 encoding.

Can anyone tell me what I can do or what the cause is?

Here is the snipet of the code:
=====================================
<?xml version="1.0" encoding="UTF-8"?>
<Submission xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="layout.xsd">
=====================================

Thanks
Suresh
Paul Clapham
Bartender

Joined: Oct 14, 2005
Posts: 18541
    
    8

Putting <?xml version="1.0" encoding="UTF-8"?> at the start of your file says it's encoded in UTF-8, but that doesn't actually cause it to be encoded in UTF-8. The process that creates the file has to write the file in that encoding. If it produces some other encoding, it should specify that encoding in the prolog. That isn't happening in your case.
Suresh Kanagalingam
Ranch Hand

Joined: Aug 17, 2001
Posts: 82
Hi Paul,

Thanks for your quick reply. The character it is complaining is "�" (Checked ASCII value for this and it is 201). Also XMLSpy validates this char correctly.

Any toughts?

Thanks
Suresh
Paul Clapham
Bartender

Joined: Oct 14, 2005
Posts: 18541
    
    8

Any thoughts beyond the thoughts I already posted? No. What about you? Have you reviewed the process that produces that file?
Suresh Kanagalingam
Ranch Hand

Joined: Aug 17, 2001
Posts: 82
Hi Paul,

I checked the program to make sure it is writing standard character set to the file. I even used TextPad to type French characters using TextPad "ANSI Character" listing.

Can you please confirm that for letter "�" to be validated with UTF-8, it has to have hex value of 201?

Thanks
Suresh
Reid M. Pinchback
Ranch Hand

Joined: Jan 25, 2002
Posts: 775
Yes and no. Unicode decimal 201 according to:

this table

but UTF-8 version is two chars.

This is something people continually get wrong
with XML. How the file is written and read
matters. The first line of the XML file
containg the declaration/version/charset is
strictly ASCII. I don't remember the exact
spec wording, but basically you are limited
to the 7-bit hunk of ASCII. All bytes after
that first line are strictly in the desired
character set. That means if you are dealing
with richer character sets you have to create
a file in the way required by the spec, or
things will break.

I suspect that right now what you may have is
a file with a single byte for either the
ASCII (131) or Unicode (201) encoding of
acute capital E.


Reid - SCJP2 (April 2002)
Paul Clapham
Bartender

Joined: Oct 14, 2005
Posts: 18541
    
    8

Originally posted by Suresh Kanagalingam:
Hi Paul,

I checked the program to make sure it is writing standard character set to the file. I even used TextPad to type French characters using TextPad "ANSI Character" listing.

Can you please confirm that for letter "�" to be validated with UTF-8, it has to have hex value of 201?

Thanks
Suresh


To reiterate what Reid said, if you're seeing a hex value of 201 in your file then it isn't encoded in UTF-8. And if you used the "standard character set" to write to the file, that almost certainly wouldn't be UTF-8 anyway.

The easiest way to get your XML encoding right in Java is to use the standard XML software (whatever's built in to your JRE, or Xerces or Xalan or Saxon or some other open-source product) and to provide an output stream (not a Writer) for it to write to. The software will take care of the encoding.

Or iff you're writing XML to a file with your own ad-hoc code, then encode it in UTF-8 like this:
Reid M. Pinchback
Ranch Hand

Joined: Jan 25, 2002
Posts: 775
And although not an issue for UTF-8, for any character set that doesn't
include 7-bit ascii as a single-byte subset, you have to deal with
both encodings, not just a single coding as shown above. First you
output the first line in the required encoding, then everything else
in the other encoding. Not something I've had to do, but suspect it
comes up with Asian character sets, maybe UTF-16?

Like Paul said, doing something in a tool that understands this,
like serializing DOM, is generally just much safer.

[ January 12, 2006: Message edited by: Reid M. Pinchback ]
[ January 12, 2006: Message edited by: Reid M. Pinchback ]
 
 
subject: XML Validation
 
Similar Threads
Problem with processing data files of size larger than 350 MB
Ajax: responseXML, getElementsByTagName
SAXException: Invalid byte 2 of 2-byte UTF-8 sequence
Problem with DTD Processing
XML Base16, Base 32 AND Base64 encoding difference