File APIs for Java Developers
Manipulate DOC, XLS, PPT, PDF and many others from your application.
The moose likes XML and Related Technologies and the fly likes multiple language support in one XML Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login

Win a copy of Android Security Essentials Live Lessons this week in the Android forum!
JavaRanch » Java Forums » Engineering » XML and Related Technologies
Bookmark "multiple language support in one XML" Watch "multiple language support in one XML" New topic

multiple language support in one XML

Yan Zhou
Ranch Hand

Joined: Sep 02, 2003
Posts: 137

is it possible to support multiple languages in one XML, e.g., Japanese and Chinese text in the same XML? If so, how would I specify encoding type in that XML, simply UTF-8?

The reason of this question is because I run into issues in dealing with Chinese text in my Java program. I use latest JAXB as XML parser, and store the text in a UNICODE (UTF-8) PostgreSQL database (latest version).

As I type in Chinese text, JAXB has no problem marshalling my text into a XML string with encoding type set to UTF-8, and my code successfully saves the XML text into the database; but when reading out, the JAXB Unmarshaller gives error: invalid byte 2 of 3 byte UTF-8 character, on the XML string I just read from DB.

The first question is, if both my XML and DB specify encoding type being UTF-8, why am I still having problem parsing the XML text?

Someone mentioned that I have to tell the parser the character set I used, which is "GB2312". Just because JAXB supports UTF-8, does not mean it knows how to convert Chinese text into UTF-8. Once I changed the encoding to GB2312, the program worked, reading out XML text had no problem.

However, my question continues, if I need to support both Japanese and Chinese text in the same XML, how do I specify the encoding type since now I have two different encoding. Do I have to convert my text into UTF-8 myself and set XML encoding as UTF-8?

Another question is, what is the relationship between UTF-8 and all the character sets (GB2312, Big5, etc.)

Since a XML file must be in one of the languages, therefore, a XML file must use one of the character sets, and in turn, the encoding attribute in XML must be the character set, NOT "UTF-8" (since the parser does not know how to convert characters into UTF-8 without knowing the character set in use). If so, when would we ever use "UTF-8" in our XML for encoding?

Yan Zhou
Ranch Hand

Joined: Sep 02, 2003
Posts: 137
another issue I do not understand is, if UTF-8 should not be used when I am inputting Chinese text (use GB2312 instead), why JAXB does not report error when marshalling the text, only does so when unmarshalling them?

I agree. Here's the link:
subject: multiple language support in one XML
Similar Threads
What every developer should know about character encoding
Encoding Type Of XML Document
Parsing RSS2.0 feeds using XML Pull Parser
Confusion in Java encoding
XML file parsed with encoding "ISO-8859-1" but not with "UTF-8"