GeeCON Prague 2014*
The moose likes Java in General and the fly likes String, UTF-8, Unicode and DB Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


JavaRanch » Java Forums » Java » Java in General
Bookmark "String, UTF-8, Unicode and DB" Watch "String, UTF-8, Unicode and DB" New topic
Author

String, UTF-8, Unicode and DB

Rudy Simon Yeung
Greenhorn

Joined: Jun 06, 2003
Posts: 15
Is java string a unicode string? If java string is a unicode string, then if I create a database with codeset equal to UTF-8, should I be able to read exactly the same string I write to the database regardless of the encoding scheme I specify in instantiating a string? For example:
If I put an apple and an orange defined below into the UTF-8 database, will I get back the apple and orange since they are unicode strings?
String apple = new String(unicodeString.getBytes("CP937"), "CP037");
String orange = new String(unicodeString.getBytes("CP037"), "CP937");
a young
Greenhorn

Joined: Aug 05, 2003
Posts: 11
String is unicode. This may help you in the future:
http://java.sun.com/docs/books/tutorial/native1.1/implementing/string.html
Jim Yingst
Wanderer
Sheriff

Joined: Jan 30, 2000
Posts: 18671
If I put an apple and an orange defined below into the UTF-8 database, will I get back the apple and orange since they are unicode strings?
Not necessarily. The problem is, it will only work if all the unicode chars in the "apple" and "orange" are char that can be encoded in CP937 and CP037. I'm not familiar with those encodings, so I don't know. But I know that for example if you use most western encodings, and you try to encode chars that are from Asian languages, those chars cannot be encoded, and likely you will get a '?' instead. So if something like this happens, you will be unable to reconstruct the original apple or orange.
There are some encodings which are guaranteed to be able to represent all Unicode chars - or at least, all that are used in Java. (Long story.) UTF-8 and UTF-16 are the best-known examples of this; there are probably others.


"I'm not back." - Bill Harding, Twister
Rudy Yeung
Ranch Hand

Joined: Dec 27, 2000
Posts: 183
A database with UTF-8 codeset is actually a unicode database, and also a java string is a unicode string. If that is the case, should not I get back the same orange and apple unicode strings though I instantiate the strings using both 'CP937' and 'CP037'? Actually, I experiement it before with the some unicode having Chinese characters and I really get back the exact unicode from the database without any '?'. The only concern I have is that will some Chinese characters still cannot be converted, resulting in '?' instead?
Jim Yingst
Wanderer
Sheriff

Joined: Jan 30, 2000
Posts: 18671
A database with UTF-8 codeset is actually a unicode database, and also a java string is a unicode string. If that is the case, should not I get back the same orange and apple unicode strings though I instantiate the strings using both 'CP937' and 'CP037'?
What does the database have to do with CP937 and CP037? The database uses UTF-8, and so it won't lose any chars by converting them into ? or something similar. But if you convert your strings to/from other encodeings, such as CP937 and CP037, you may indeed lose chars. Why are you useing those encodings anyway?
It seems that both those encodings are for Chinese predominantly, so I'm guessing that if you just use Chinese chars, you will not experience any '?'. Also you can probably use most Western ASCII chars; they only take up a few spaces out of all the Chinese chars used, so they probably made space for them. However, if you try encoding chars from other alphabets, you will probably find that CP937 and CP037 do not support them as well. Try these:

Of course, you may not actually need any of these chars anyway - I'm just trying to demonstrate the range of possibilities.
Rudy Yeung
Ranch Hand

Joined: Dec 27, 2000
Posts: 183
It seems that both those encodings are for Chinese predominantly, so I'm guessing that if you just use Chinese chars, you will not experience any '?'. Also you can probably use most Western ASCII chars; they only take up a few spaces out of all the Chinese chars used, so they probably made space for them.
Yes, our application only deals with Chinese and English characters. The CP937 are the EBCIDIC Chinese, whereas CP037 the EBCIDIC English. Our front end application performs a double encoding from unicode to CP037 and then to CP937, and then sends this CP937 encoded string to the back end AS/400 with CP937 codepage for processing. AS/400 acknowledges by sending back the message. Our front end performs a double decoding from CP937 to CP037 and then back to unicode. We log both the sent and received CP937 encoded messages into our UTF-8 database.
From your reply, it seems that our application should be safe and should not have any '?'.
Rudy
[ August 11, 2003: Message edited by: Rudy Yeung ]
Jim Yingst
Wanderer
Sheriff

Joined: Jan 30, 2000
Posts: 18671
We log both the sent and received CP937 encoded messages into our UTF-8 database.
As what data type? BLOB? RAW? VARCHAR? The DB should only be using UTF-8 for character data, e.g. VARCHAR, so if you're using BLOB of RAW the UTF-8 isn't relevant. If you're using VARCHAR though then this whole thing sounds pretty convoluted, with a message triple-encoded in CP037, CP937, and UTF-8. I don't really know enough about how the first two work to say if this is OK or not; I guess you just have to test it. Sounds pretty weird though. I'd think you'd want to decode the message before saving it as character data.
From your reply, it seems that our application should be safe and should not have any '?'.
Ummm, no promises. You'd better test carefully, I think.
 
GeeCON Prague 2014
 
subject: String, UTF-8, Unicode and DB