• Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

Encoding Type Of XML Document

 
greg philpott
Ranch Hand
Posts: 73
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
<?xml version="1.0" encoding="UTF-8"?>
I have been working with XML and have been quite happy being blissfully ignorant and sticking with the encoding type above, thinking that it was unicode and was what Java used and so it was ok.
However, when I came across a document with the french letter:<b>�</b> or /u00E8 the parser threw an Exception because of the encoding type.
So I changed the encoding to "ISO-8859-1" and it works.
1. What good resources are there on the topic of XML encoding ?
2. Why doesn't UTF-8 contain this letter ?
 
Mapraputa Is
Leverager of our synergies
Sheriff
Posts: 10065
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Greg, maybe this will help http://www.w3schools.com/xml/xml_encoding.asp
 
Kevin Williams
Greenhorn
Posts: 16
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Greg,
This one threw me the first time I ran into it too. UTF-8, as it turns out, only covers the lower 128 of the ASCII character set - in other words, the section of the 256-character ASCII set you're probably more familiar with that contains all the accented, umlauted, and so on characters is not part of UTF-8. The two encodings that every parser is obligated to recognize are UTF-8 and UTF-16 (Unicode), so if you want to use UTF-16 you should be all right.
BTW, somebody correct me if I'm misspeaking here - it's been a while since I've done any international stuff...
- Kevin
------------------
Kevin Williams
Senior System Architect, Equient Corporation
author of: Professional XML Databases
 
Ajith Kallambella
Sheriff
Posts: 5782
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
AFAIK, XML can only contain ASCII. XSL on the otherhand can specify an encodign scheme to be used for the output file.

------------------
Ajith Kallambella M.
Sun Certified Programmer for the Java2 Platform.
 
greg philpott
Ranch Hand
Posts: 73
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
only ASCII ???
Are you sure?
What about this head tag: <?xml version="1.0" encoding="UTF-8"?>
Surely you can specify different encoding types here.
Am i wrong or are you mis-informing me, Ajith ?
 
christine goodwin
Greenhorn
Posts: 2
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
greg,
check out this link below: http://www.unicode.org/
xml parsers support UTF-8 which has unicode mappings for almost all character sets that are defined. however, just because xml supports UTF-8 doesn't mean that it knows what the encoding of the data being stored in it is. in other words, while it can handle UTF-8, it does not convert (do the mapping of the incoming data encoding to UTF-8)incoming data to UTF-8 on the fly.
so, if you are parsing an xml file, the parser must be made aware of the character encoding of the data.
if you are using the latest versions of the parsers that are out there (sun and ibm) they have convenience methods for checking, getting, and setting the character encoding. using these methods, you can inform the parser of the character encoding of the data and it will parser without error.
otherwise, you must set the character encoding of the data in your application.
so, if you were in a java environment and you were reading in data from a web client via a JSP, you could grab the encoding from the request header and use that to set the character encoding of the incoming data that you would be passing to the parser.
or, you could have a method server side that set all data to a default such as UTF-8.
-christine
Originally posted by greg philpott:
<?xml version="1.0" encoding="UTF-8"?>
I have been working with XML and have been quite happy being blissfully ignorant and sticking with the encoding type above, thinking that it was unicode and was what Java used and so it was ok.
However, when I came across a document with the french letter:<b>�</b> or /u00E8 the parser threw an Exception because of the encoding type.
So I changed the encoding to "ISO-8859-1" and it works.
1. What good resources are there on the topic of XML encoding ?
2. Why doesn't UTF-8 contain this letter ?

 
Junaid Bhatra
Ranch Hand
Posts: 213
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Kevin,
As far as I remember, UTF-8 is a variable-width encoding, and it supports all character sets. Depending on the character-set, it can be 8-bit, 16 or even 24-bit, so it covers almost all of the character sets in the world.
Originally posted by Kevin Williams:
Greg,
This one threw me the first time I ran into it too. UTF-8, as it turns out, only covers the lower 128 of the ASCII character set - in other words, the section of the 256-character ASCII set you're probably more familiar with that contains all the accented, umlauted, and so on characters is not part of UTF-8. The two encodings that every parser is obligated to recognize are UTF-8 and UTF-16 (Unicode), so if you want to use UTF-16 you should be all right.
BTW, somebody correct me if I'm misspeaking here - it's been a while since I've done any international stuff...
- Kevin

 
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic