aspose file tools*
The moose likes Java in General and the fly likes Determine String encoding Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Java » Java in General
Bookmark "Determine String encoding" Watch "Determine String encoding" New topic
Author

Determine String encoding

Trish Wu
Ranch Hand

Joined: Oct 09, 2002
Posts: 34
Hello folks,
I have a problem with converting data to UTF-8.
My task involves a Oracle database table with a long field.
First of all, the data in this table can be Chinese/Japanese/English/Korean encoded.
So when I retrieve the data, I will need to invoke:

where rs is java.sql.ResultSet
Then I will need to store the utf8 String to a new database table.
Since the default encoding is "ISO8859_1" and that I have no idea whether the data is
Chinese/Japanese/English/Korean encoded, how can I make the proper conversion?
Since the data is in Long Field and I will have to use getBytes() to get the data and
convert it to the local encoding.
So I am asking if there is any way that I can determine what these bytes'
original encoding was?
Is there anything in Character class that I can make use of??
Pls help.
Michael Borgwardt
Greenhorn

Joined: Dec 06, 2002
Posts: 9

First of all, the data in this table can be Chinese/Japanese/English/Korean encoded.
So when I retrieve the data, I will need to invoke:
String localStr = new String(rs.getBytes(CONTENT_INDEX),"SJIS");
String utf8Str = new String (localStr.getBytes("SJIS"), "UTF8");
where rs is java.sql.ResultSet
Then I will need to store the utf8 String to a new database table.

You're doing something horribly wrong there. First you take the bytes and interpret them as a SJIS-encoded string. Then you reverse the operation and re-interpret the bytes as an UTF-8 String. Then you store it into a a DB using god knows what encoding. The result is very likely incorrect.


Since the default encoding is "ISO8859_1" and that I have no idea whether the data is
Chinese/Japanese/English/Korean encoded, how can I make the proper conversion?

Simply put, you cannot. Different encodings may well encode different legal character sequences into the same byte sequence.
Trish Wu
Ranch Hand

Joined: Oct 09, 2002
Posts: 34
Thanks for your reply, Michael.
I am doing this encoding conversion so that my client can upgrade to a newer version content management system which only capable of interpreting UTF8 data.
Actually, the Japanese (SJIS) to UTF8 conversion works fine. My problem is how to determine the data is originally SJIS encoded from the byte array. Because there isa lot of data and it is very hard for me to find out which is Chinese/Japanese/Korean encoded..
Jim Yingst
Wanderer
Sheriff

Joined: Jan 30, 2000
Posts: 18671
I once worked on a project similar to this; the database was set up so that one of the columns was a language code. The DBAs had set up rules so that by looking up the language code for a record, you could determine what encoding to use for the content (which was stored in the DB as raw bytes). I think you need to talk to the DBA for this database and have them explain how encoding is determined; if they don't have a solution for you, have them shot. There's no excuse for putting multilingual data in a database without knowing what encoding is being used. If the original DBA is no longer availabe (perhaps because he's already been shot) then study the fields of the table yorself to see if there's some column which correlates to the encoding used. Good luck.


"I'm not back." - Bill Harding, Twister
Michael Borgwardt
Greenhorn

Joined: Dec 06, 2002
Posts: 9
Trish, you are not doing a SJIS-to-UTF8 conversion, at least not with the code you posted. That code first converts from bytes to characters forth and back using SJIS, which is a waste of time, then converts the bytes to characters using UTF-8, which is simply incorrect if the bytes are in SJIS.
If it works, then some other code is reversing the errors yours is creating.
Trish Wu
Ranch Hand

Joined: Oct 09, 2002
Posts: 34
Thank you Jim and Micheal for your advice.
So I will have to assume certain rows of data are SJIS encoded while some other are IS08859_1, do the conversion and then tell my clients to verify with their own eyes.
Jim Yingst
Wanderer
Sheriff

Joined: Jan 30, 2000
Posts: 18671
I should add that I completely agree with Michael's comments - the conversion you're showing doesn't make any real sense. Actually I'm surprised it doesn't throw some sort of RuntimeException sometimes, as it seems like you'd find some SJIS byte sequences that are not legal in UTF-8. It may look OK for Roman alphabet characters, since they're probably encoded similarly in UTF-8 and SJIS - but for other chars, at least one of those strings is horribly wrong.
In fact, it seems like if you can't get any better indicator of which incoding to use, you might be able to make use of the fact that some byte sequences are illegal in UTF-8. First assume the input is UTF-8 and try to interpret it. If an exception is thrown, catch it and assume that means the encoding is really SJIS. If an exception is not thrown, well, hopefully that means it really was UTF-8. Or it might just mean the input was short, and didn't happen to contain any of the illegal sequences. So you've just got to make your best guess - it's not your fault if the DB doesn't contain the info you need to decode properly.
I'm not 100% sure that SJIS contains sequences which would be illegal in UTF-8 - perhaps SJIS has some properties that make this impossible. Or perhaps your data never happens to use the characters that would correspond to illegal UTF-8 sequences. But it's also possible that you simply haven't noticed the phenomenon yet - either because all the data you've seen so far is really UTF-8, or because somewhere you've got a catch block which is catching an exception without making you aware of the problem it signifies. (Which is of course a very, very bad idea.) If it turns out that SJIS bytes can always be interpreted as legal UTF-8 (even though the characters are gibberish), please let us know - it would be useful to know this in the future.
Note that if you take the check-for-encoding-error approach it would probably be a good idea to use a java.nio.charset.CharsetDecoder, which actually documents the exceptions it may throw - unlike the String() constructor which evidently considers encoding exceptions as RuntimeExceptions which need not be documented. Good luck.
JS Shirah
Greenhorn

Joined: Jan 23, 2003
Posts: 3
Just a word to say that there has been a *lot* of work in Java and among the various DBMSes for i18n and National Language Support ( NLS ). As your system is currently designed, you will always have to do an enormous amount of work ( even if there is some column that says "This is Korean" ), and end up with a database that is good for nothing but your application or others that mimic it. In other words, non-portable data.
If you need to handle multiple languages in the database, the answer, just as in the Java language, is Unicode. As far as I know, Oracle supports UTF-8, and your tables should be set up that way. Then you have no conversions to do. As a bonus, your application will run faster because no conversion code, even into Java, needs to be done. From what I understand of your app, there are probably no issues, other than removing lots of code, to stop you from recreating the database in Unicode even at this point.

Joe Sam
Joe Sam Shirah - http://www.conceptgo.com
conceptGO - Consulting/Development/Outsourcing
Java Filter Forum: http://www.ibm.com/developerworks/java/
Just the JDBC FAQs: http://www.jguru.com/faq/JDBC
Going International? http://www.jguru.com/faq/I18N
Que Java400? http://www.jguru.com/faq/Java400
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: Determine String encoding