aspose file tools*
The moose likes XML and Related Technologies and the fly likes Why UTF-16 is needed for XML? Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Engineering » XML and Related Technologies
Bookmark "Why UTF-16 is needed for XML?" Watch "Why UTF-16 is needed for XML?" New topic
Author

Why UTF-16 is needed for XML?

ankur rathi
Ranch Hand

Joined: Oct 11, 2004
Posts: 3830
I have seen almost all the XML files start with this line:

<?xml version='1.0' encoding='UTF-16'?>

And as for as I know, UTF-16 encoding scheme is the most memory consuming because it supports highest number of the languages.

But we usually have only English characters in XML files (localized characters are present in properties file), then why we store XML files in UTF-16 format???

Thanks.
Paul Clapham
Bartender

Joined: Oct 14, 2005
Posts: 18675
    
    8

You don't have to use UTF-16 to store XML. You can use any encoding at all, provided the parser can deal with it. It is extremely common to use UTF-8. So your question is asked in the wrong place. If you want to know why a certain XML file was encoded in UTF-16, you should ask the person who chose that encoding.
ankur rathi
Ranch Hand

Joined: Oct 11, 2004
Posts: 3830
Originally posted by Paul Clapham:
You don't have to use UTF-16 to store XML. You can use any encoding at all, provided the parser can deal with it. It is extremely common to use UTF-8. So your question is asked in the wrong place. If you want to know why a certain XML file was encoded in UTF-16, you should ask the person who chose that encoding.


Okay but why even UTF-8???
Why not to use ASCII if the XML file is containing only English characters???

Thanks.
Dave Lenton
Ranch Hand

Joined: Jan 20, 2005
Posts: 1241
Originally posted by rathi ji:
Why not to use ASCII if the XML file is containing only English characters???
The trouble with limiting yourself to a small character set is that it may not be flexible if you need to add extra characters in the future. OK, so you may be able to just change the coding at the top of the XML file, but then this may have knock on effects on other parts of the system. It may be better to start off with something reasonable like UTF-8 or UTF-16, and make the system work with that.

I guess it depends on what you're storing in your XML file, and how likely the kind of data in it is likely to change in the future.


There will be glitches in my transition from being a saloon bar sage to a world statesman. - Tony Banks
Paul Clapham
Bartender

Joined: Oct 14, 2005
Posts: 18675
    
    8

Originally posted by rathi ji:
Okay but why even UTF-8???
Why not to use ASCII if the XML file is containing only English characters???
ASCII is a subset of UTF-8. So if your file really contains only unaccented Latin letters, it's going to look identical whether it's encoded in ASCII or UTF-8. (Except for the prolog where you declare the encoding, of course.)

And as soon as somebody uses an accented letter in their data, the code that writes the ASCII version has to know to change it to a Unicode escape in the output. The standard Java classes do know this, of course, but many people don't use the built-in classes and prefer to write their own code that may not know it.

Basically UTF-8 can represent any character at all, including ASCII characters, and it doesn't cost anything extra to use it for ASCII characters. So it just makes sense to use UTF-8. (Or UTF-16 if your data contains a large percentage of CJK characters.)
 
Consider Paul's rocket mass heater.
 
subject: Why UTF-16 is needed for XML?