File APIs for Java Developers
Manipulate DOC, XLS, PPT, PDF and many others from your application.
http://aspose.com/file-tools
The moose likes XML and Related Technologies and the fly likes Encoding Type Of XML Document Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


Win a copy of JavaScript Promises Essentials this week in the JavaScript forum!
JavaRanch » Java Forums » Engineering » XML and Related Technologies
Bookmark "Encoding Type Of XML Document" Watch "Encoding Type Of XML Document" New topic
Author

Encoding Type Of XML Document

greg philpott
Ranch Hand

Joined: Nov 10, 2000
Posts: 73
<?xml version="1.0" encoding="UTF-8"?>
I have been working with XML and have been quite happy being blissfully ignorant and sticking with the encoding type above, thinking that it was unicode and was what Java used and so it was ok.
However, when I came across a document with the french letter:<b>�</b> or /u00E8 the parser threw an Exception because of the encoding type.
So I changed the encoding to "ISO-8859-1" and it works.
1. What good resources are there on the topic of XML encoding ?
2. Why doesn't UTF-8 contain this letter ?
Mapraputa Is
Leverager of our synergies
Sheriff

Joined: Aug 26, 2000
Posts: 10065
Greg, maybe this will help http://www.w3schools.com/xml/xml_encoding.asp


Uncontrolled vocabularies
"I try my best to make *all* my posts nice, even when I feel upset" -- Philippe Maquet
Kevin Williams
Greenhorn

Joined: Jan 03, 2001
Posts: 16
Greg,
This one threw me the first time I ran into it too. UTF-8, as it turns out, only covers the lower 128 of the ASCII character set - in other words, the section of the 256-character ASCII set you're probably more familiar with that contains all the accented, umlauted, and so on characters is not part of UTF-8. The two encodings that every parser is obligated to recognize are UTF-8 and UTF-16 (Unicode), so if you want to use UTF-16 you should be all right.
BTW, somebody correct me if I'm misspeaking here - it's been a while since I've done any international stuff...
- Kevin
------------------
Kevin Williams
Senior System Architect, Equient Corporation
author of: Professional XML Databases


Kevin Williams<BR>Senior System Architect, Equient Corporation<BR>author of: <A HREF="http://www.amazon.com/exec/obidos/ASIN/1861003587/electricporkchop" TARGET=_blank rel="nofollow">Professional XML Databases</A>
Ajith Kallambella
Sheriff

Joined: Mar 17, 2000
Posts: 5782
AFAIK, XML can only contain ASCII. XSL on the otherhand can specify an encodign scheme to be used for the output file.

------------------
Ajith Kallambella M.
Sun Certified Programmer for the Java2 Platform.


Open Group Certified Distinguished IT Architect. Open Group Certified Master IT Architect. Sun Certified Architect (SCEA).
greg philpott
Ranch Hand

Joined: Nov 10, 2000
Posts: 73
only ASCII ???
Are you sure?
What about this head tag: <?xml version="1.0" encoding="UTF-8"?>
Surely you can specify different encoding types here.
Am i wrong or are you mis-informing me, Ajith ?
christine goodwin
Greenhorn

Joined: Jan 25, 2001
Posts: 2
greg,
check out this link below: http://www.unicode.org/
xml parsers support UTF-8 which has unicode mappings for almost all character sets that are defined. however, just because xml supports UTF-8 doesn't mean that it knows what the encoding of the data being stored in it is. in other words, while it can handle UTF-8, it does not convert (do the mapping of the incoming data encoding to UTF-8)incoming data to UTF-8 on the fly.
so, if you are parsing an xml file, the parser must be made aware of the character encoding of the data.
if you are using the latest versions of the parsers that are out there (sun and ibm) they have convenience methods for checking, getting, and setting the character encoding. using these methods, you can inform the parser of the character encoding of the data and it will parser without error.
otherwise, you must set the character encoding of the data in your application.
so, if you were in a java environment and you were reading in data from a web client via a JSP, you could grab the encoding from the request header and use that to set the character encoding of the incoming data that you would be passing to the parser.
or, you could have a method server side that set all data to a default such as UTF-8.
-christine
Originally posted by greg philpott:
<?xml version="1.0" encoding="UTF-8"?>
I have been working with XML and have been quite happy being blissfully ignorant and sticking with the encoding type above, thinking that it was unicode and was what Java used and so it was ok.
However, when I came across a document with the french letter:<b>�</b> or /u00E8 the parser threw an Exception because of the encoding type.
So I changed the encoding to "ISO-8859-1" and it works.
1. What good resources are there on the topic of XML encoding ?
2. Why doesn't UTF-8 contain this letter ?

Junaid Bhatra
Ranch Hand

Joined: Jun 27, 2000
Posts: 213
Kevin,
As far as I remember, UTF-8 is a variable-width encoding, and it supports all character sets. Depending on the character-set, it can be 8-bit, 16 or even 24-bit, so it covers almost all of the character sets in the world.
Originally posted by Kevin Williams:
Greg,
This one threw me the first time I ran into it too. UTF-8, as it turns out, only covers the lower 128 of the ASCII character set - in other words, the section of the 256-character ASCII set you're probably more familiar with that contains all the accented, umlauted, and so on characters is not part of UTF-8. The two encodings that every parser is obligated to recognize are UTF-8 and UTF-16 (Unicode), so if you want to use UTF-16 you should be all right.
BTW, somebody correct me if I'm misspeaking here - it's been a while since I've done any international stuff...
- Kevin

 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: Encoding Type Of XML Document