This week's book giveaway is in the OO, Patterns, UML and Refactoring forum. We're giving away four copies of Refactoring for Software Design Smells: Managing Technical Debt and have Girish Suryanarayana, Ganesh Samarthyam & Tushar Sharma on-line! See this thread for details.
is it possible to support multiple languages in one XML, e.g., Japanese and Chinese text in the same XML? If so, how would I specify encoding type in that XML, simply UTF-8?
The reason of this question is because I run into issues in dealing with Chinese text in my Java program. I use latest JAXB as XML parser, and store the text in a UNICODE (UTF-8) PostgreSQL database (latest version).
As I type in Chinese text, JAXB has no problem marshalling my text into a XML string with encoding type set to UTF-8, and my code successfully saves the XML text into the database; but when reading out, the JAXB Unmarshaller gives error: invalid byte 2 of 3 byte UTF-8 character, on the XML string I just read from DB.
The first question is, if both my XML and DB specify encoding type being UTF-8, why am I still having problem parsing the XML text?
Someone mentioned that I have to tell the parser the character set I used, which is "GB2312". Just because JAXB supports UTF-8, does not mean it knows how to convert Chinese text into UTF-8. Once I changed the encoding to GB2312, the program worked, reading out XML text had no problem.
However, my question continues, if I need to support both Japanese and Chinese text in the same XML, how do I specify the encoding type since now I have two different encoding. Do I have to convert my text into UTF-8 myself and set XML encoding as UTF-8?
Another question is, what is the relationship between UTF-8 and all the character sets (GB2312, Big5, etc.)
Since a XML file must be in one of the languages, therefore, a XML file must use one of the character sets, and in turn, the encoding attribute in XML must be the character set, NOT "UTF-8" (since the parser does not know how to convert characters into UTF-8 without knowing the character set in use). If so, when would we ever use "UTF-8" in our XML for encoding?
Joined: Sep 02, 2003
another issue I do not understand is, if UTF-8 should not be used when I am inputting Chinese text (use GB2312 instead), why JAXB does not report error when marshalling the text, only does so when unmarshalling them?