Granny's Programming Pearls
"inside of every large program is a small program struggling to get out"
JavaRanch.com/granny.jsp
The moose likes XML and Related Technologies and the fly likes Why UTF-16 is needed for XML? Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


Win a copy of Android Security Essentials Live Lessons this week in the Android forum!
JavaRanch » Java Forums » Engineering » XML and Related Technologies
Bookmark "Why UTF-16 is needed for XML?" Watch "Why UTF-16 is needed for XML?" New topic
Author

Why UTF-16 is needed for XML?

ankur rathi
Ranch Hand

Joined: Oct 11, 2004
Posts: 3830
I have seen almost all the XML files start with this line:

<?xml version='1.0' encoding='UTF-16'?>

And as for as I know, UTF-16 encoding scheme is the most memory consuming because it supports highest number of the languages.

But we usually have only English characters in XML files (localized characters are present in properties file), then why we store XML files in UTF-16 format???

Thanks.
Paul Clapham
Bartender

Joined: Oct 14, 2005
Posts: 18541
    
    8

You don't have to use UTF-16 to store XML. You can use any encoding at all, provided the parser can deal with it. It is extremely common to use UTF-8. So your question is asked in the wrong place. If you want to know why a certain XML file was encoded in UTF-16, you should ask the person who chose that encoding.
ankur rathi
Ranch Hand

Joined: Oct 11, 2004
Posts: 3830
Originally posted by Paul Clapham:
You don't have to use UTF-16 to store XML. You can use any encoding at all, provided the parser can deal with it. It is extremely common to use UTF-8. So your question is asked in the wrong place. If you want to know why a certain XML file was encoded in UTF-16, you should ask the person who chose that encoding.


Okay but why even UTF-8???
Why not to use ASCII if the XML file is containing only English characters???

Thanks.
Dave Lenton
Ranch Hand

Joined: Jan 20, 2005
Posts: 1241
Originally posted by rathi ji:
Why not to use ASCII if the XML file is containing only English characters???
The trouble with limiting yourself to a small character set is that it may not be flexible if you need to add extra characters in the future. OK, so you may be able to just change the coding at the top of the XML file, but then this may have knock on effects on other parts of the system. It may be better to start off with something reasonable like UTF-8 or UTF-16, and make the system work with that.

I guess it depends on what you're storing in your XML file, and how likely the kind of data in it is likely to change in the future.


There will be glitches in my transition from being a saloon bar sage to a world statesman. - Tony Banks
Paul Clapham
Bartender

Joined: Oct 14, 2005
Posts: 18541
    
    8

Originally posted by rathi ji:
Okay but why even UTF-8???
Why not to use ASCII if the XML file is containing only English characters???
ASCII is a subset of UTF-8. So if your file really contains only unaccented Latin letters, it's going to look identical whether it's encoded in ASCII or UTF-8. (Except for the prolog where you declare the encoding, of course.)

And as soon as somebody uses an accented letter in their data, the code that writes the ASCII version has to know to change it to a Unicode escape in the output. The standard Java classes do know this, of course, but many people don't use the built-in classes and prefer to write their own code that may not know it.

Basically UTF-8 can represent any character at all, including ASCII characters, and it doesn't cost anything extra to use it for ASCII characters. So it just makes sense to use UTF-8. (Or UTF-16 if your data contains a large percentage of CJK characters.)
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: Why UTF-16 is needed for XML?
 
Similar Threads
What every developer should know about character encoding
How to find encoding of byte[]
Japanese character set
String functions and ISO-8895-1 encoding.
Setting encoding in web.xml