aspose file tools*
The moose likes I/O and Streams and the fly likes Japanese character set Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Java » I/O and Streams
Bookmark "Japanese character set" Watch "Japanese character set" New topic
Author

Japanese character set

Anjali S Sharma
Ranch Hand

Joined: Jun 29, 2005
Posts: 279
What are the problems one may encounter while using Japanese character set. How can they be solved.

I came to know that Japanese character set is a 2 byte character set. Will it have any effect on my application which we intend to develop in Japanese.
Campbell Ritchie
Sheriff

Joined: Oct 13, 2005
Posts: 39436
    
  28
Not got a lot of time to reply at present, but it means all Strings need two chars per code point. I posted about code points on this site in the last week; please search for that post, which I can't repeat for copyright reasons.

Don't know what else you will have to change.
Anjali S Sharma
Ranch Hand

Joined: Jun 29, 2005
Posts: 279
Originally posted by Campbell Ritchie:
Not got a lot of time to reply at present, but it means all Strings need two chars per code point. I posted about code points on this site in the last week; please search for that post, which I can't repeat for copyright reasons.

Don't know what else you will have to change.


Thanks for replying. Does that mean I have to use UTF-16 or will UTF-8 work as well
Campbell Ritchie
Sheriff

Joined: Oct 13, 2005
Posts: 39436
    
  28
You're welcome. Got a bit more time now, and have found my old post, here. The quote I posted suggests you probably will have to use UTF-16.
Anjali S Sharma
Ranch Hand

Joined: Jun 29, 2005
Posts: 279
Originally posted by Campbell Ritchie:
You're welcome. Got a bit more time now, and have found my old post, here. The quote I posted suggests you probably will have to use UTF-16.


Thanks for the post.

This is what I have come to understand. If there is anything to correct from the list or add to it, please let me know


We can use either UTF-8 (should be used if there is plenty of Western text too. Otherwise it becomes less efficient, often using 3 bytes and even 4 per char) or UTF-16 without any problems. The only things that one need to watch out for while using Japanese characters are

1. Reading and Writing of files should be done using Reader/Writer (Java internally handles the encoding) and not InputStream/OutputStream
2. If there is any other software that is being used (like XML parsers), they should also use the same encoding (UTF-8 or UTF-16) with which the file (which needs to be parsed) was created.
3. Database encoding is another thing to ensure is correct
Campbell Ritchie
Sheriff

Joined: Oct 13, 2005
Posts: 39436
    
  28
Don't know any more about it, but I think your 3 points are all correct. Not certain, however.
Guido Sautter
Ranch Hand

Joined: Dec 22, 2004
Posts: 142
What's the difference between UTF-8 and UTF-16, with regard to what characters (code points) they can encode? Used to think both are (slightly different) ways of representing unicode characters (code points) with bytes ...
Paul Clapham
Bartender

Joined: Oct 14, 2005
Posts: 18675
    
    8

Originally posted by Campbell Ritchie:
Don't know any more about it, but I think your 3 points are all correct. Not certain, however.


Number 1 is correct provided that the encoding of the file is the same as your system's default encoding. (This is unlikely to be the case if the file's encoding is UTF-8 or UTF-16.) Or provided you use an InputStreamReader or OutputStreamWriter which specifies the correct encoding.

Number 2: if your XML document is encoded in UTF-8 or UTF-16, then the standard XML parsers will be able to detect that and work correctly. (That's required by the XML spec.) This is provided you give them the chance. So pass them an InputStream or a File or a URL and they will deal with it. If you pass a Reader, then it's your responsibility to get the encoding right, so it's best not to do that.

Number 3: definitely.
Paul Clapham
Bartender

Joined: Oct 14, 2005
Posts: 18675
    
    8

Originally posted by Guido Sautter:
What's the difference between UTF-8 and UTF-16, with regard to what characters (code points) they can encode?
They can both encode all Unicode characters.
Guido Sautter
Ranch Hand

Joined: Dec 22, 2004
Posts: 142
Originally posted by Paul Clapham:
They can both encode all Unicode characters.


Then why'd you answer the question if UTF-8 would work, or if UTF-16 had to be used to an effect that it was UTF-16 to use ... thought UTF-8 would work out just as fine. Confused me a little ...
Paul Clapham
Bartender

Joined: Oct 14, 2005
Posts: 18675
    
    8

Originally posted by Guido Sautter:
Then why'd you answer the question if UTF-8 would work, or if UTF-16 had to be used to an effect that it was UTF-16 to use ... thought UTF-8 would work out just as fine. Confused me a little ...
I'm not sure what you are saying I said, but I don't see anywhere I said anything like what you seem to be saying I said. I'm confused.
Guido Sautter
Ranch Hand

Joined: Dec 22, 2004
Posts: 142
Sorry Paul, Campbell was the one who said what confused me ...
Campbell Ritchie
Sheriff

Joined: Oct 13, 2005
Posts: 39436
    
  28
I'm sorry about that.
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: Japanese character set