File APIs for Java Developers
Manipulate DOC, XLS, PPT, PDF and many others from your application.
http://aspose.com/file-tools
Win a copy of Clojure in Action this week in the Clojure forum!
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

XML Validation

 
Suresh Kanagalingam
Ranch Hand
Posts: 82
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hello,

I am trying to validate an XML file using Apache xerces-2_7_1. The encoding I am using in the XML file is UTF-8. When I have french chars in the file, I am getting "Invalid byte 2 of 2-byte UTF-8 sequence" error message. If I change the encoding to "ISO-8859-1", validation works fine, but the customer wants to use encoding UTF-8.

When I tested same file with XMLSpy, it is validating fine with UTF-8 encoding.

Can anyone tell me what I can do or what the cause is?

Here is the snipet of the code:
=====================================
<?xml version="1.0" encoding="UTF-8"?>
<Submission xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="layout.xsd">
=====================================

Thanks
Suresh
 
Paul Clapham
Sheriff
Pie
Posts: 20185
26
MySQL Database
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Putting <?xml version="1.0" encoding="UTF-8"?> at the start of your file says it's encoded in UTF-8, but that doesn't actually cause it to be encoded in UTF-8. The process that creates the file has to write the file in that encoding. If it produces some other encoding, it should specify that encoding in the prolog. That isn't happening in your case.
 
Suresh Kanagalingam
Ranch Hand
Posts: 82
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi Paul,

Thanks for your quick reply. The character it is complaining is "�" (Checked ASCII value for this and it is 201). Also XMLSpy validates this char correctly.

Any toughts?

Thanks
Suresh
 
Paul Clapham
Sheriff
Pie
Posts: 20185
26
MySQL Database
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Any thoughts beyond the thoughts I already posted? No. What about you? Have you reviewed the process that produces that file?
 
Suresh Kanagalingam
Ranch Hand
Posts: 82
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi Paul,

I checked the program to make sure it is writing standard character set to the file. I even used TextPad to type French characters using TextPad "ANSI Character" listing.

Can you please confirm that for letter "�" to be validated with UTF-8, it has to have hex value of 201?

Thanks
Suresh
 
Reid M. Pinchback
Ranch Hand
Posts: 775
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Yes and no. Unicode decimal 201 according to:

this table

but UTF-8 version is two chars.

This is something people continually get wrong
with XML. How the file is written and read
matters. The first line of the XML file
containg the declaration/version/charset is
strictly ASCII. I don't remember the exact
spec wording, but basically you are limited
to the 7-bit hunk of ASCII. All bytes after
that first line are strictly in the desired
character set. That means if you are dealing
with richer character sets you have to create
a file in the way required by the spec, or
things will break.

I suspect that right now what you may have is
a file with a single byte for either the
ASCII (131) or Unicode (201) encoding of
acute capital E.
 
Paul Clapham
Sheriff
Pie
Posts: 20185
26
MySQL Database
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Originally posted by Suresh Kanagalingam:
Hi Paul,

I checked the program to make sure it is writing standard character set to the file. I even used TextPad to type French characters using TextPad "ANSI Character" listing.

Can you please confirm that for letter "�" to be validated with UTF-8, it has to have hex value of 201?

Thanks
Suresh


To reiterate what Reid said, if you're seeing a hex value of 201 in your file then it isn't encoded in UTF-8. And if you used the "standard character set" to write to the file, that almost certainly wouldn't be UTF-8 anyway.

The easiest way to get your XML encoding right in Java is to use the standard XML software (whatever's built in to your JRE, or Xerces or Xalan or Saxon or some other open-source product) and to provide an output stream (not a Writer) for it to write to. The software will take care of the encoding.

Or iff you're writing XML to a file with your own ad-hoc code, then encode it in UTF-8 like this:
 
Reid M. Pinchback
Ranch Hand
Posts: 775
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
And although not an issue for UTF-8, for any character set that doesn't
include 7-bit ascii as a single-byte subset, you have to deal with
both encodings, not just a single coding as shown above. First you
output the first line in the required encoding, then everything else
in the other encoding. Not something I've had to do, but suspect it
comes up with Asian character sets, maybe UTF-16?

Like Paul said, doing something in a tool that understands this,
like serializing DOM, is generally just much safer.

[ January 12, 2006: Message edited by: Reid M. Pinchback ]
[ January 12, 2006: Message edited by: Reid M. Pinchback ]
 
I agree. Here's the link: http://aspose.com/file-tools
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic