aspose file tools*
The moose likes Beginning Java and the fly likes How to display Cyrillic in Java Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Java » Beginning Java
Bookmark "How to display Cyrillic in Java" Watch "How to display Cyrillic in Java" New topic
Author

How to display Cyrillic in Java

Thomas Kiersted
Greenhorn

Joined: Feb 24, 2007
Posts: 22

Not really sure where to put this - thought it would be easy enough but I can't figure out how. Basically I need to be able to display Cyrllic as well as Roman characters, but right now I can only display Roman ones. And I can't figure out how to get Cyrllic. I'm importing from a Unicode text file using Scanner, and I think it's being read in as Unicode, but am not sure. The problem is, whenever I print it out, all I get is gobblydygook. This happens whether I print to the console (tested on both Windows Command Prompt and OSX Terminal), or if I label a Button with the Cyrillic text (using the Abstract Windowing Toolkit to create the button).

If it were just the command prompt that had a problem, I'd say it's Windows and OSX that are at fault, but since it fails even when I try to display it using the AWT I'm thinking I need to change something within Java. Do I need an international edition? If so, where would I download it (Google didn't turn one up)? Or do I just need to import some package to make Unicode magically work? Or is Scanner not the way to read in Unicode non-Roman characters?

Platform is Windows XP SP3. I'd prefer if it worked on OSX as well, but I'd settle for just Windows. The Cyrillic text is displayed correctly in Notepad, and I have chosen "Unicode" uncoding (other options are Unicode big endian and UTF-8 - UTF-8 causes an exception and big endian results in slightly different gobblydygook).

Thanks - never thought such a seemingly simply task would be so difficult!
Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 39548
    
  27
Java has full support for Unicode - there's no need to install anything else.

Terminals and consoles are not good testbeds for Unicode text, since most of them only support the ISO-8859-1 character range.

How are you constructing the Scanner? You need to tell it which encoding the text is in; there are several Scanner constructors that take an encoding as an additional parameters.

Lastly, you can check which characters have been read by iterating through the resulting String, and printing out the Unicode values by calling String.codePointAt(int).


Ping & DNS - updated with new look and Ping home screen widget
Campbell Ritchie
Sheriff

Joined: Oct 13, 2005
Posts: 36508
    
  16
Java is supposed to support the whole of Unicode, but the command prompt on Windows only seems to support ASCII; it won't even print � correctly for me.

Suggest you scan a little of the file, then split your text into chars with the String#toCharArray method then print each char using a %x tag so it comes out in hex.
Then compare the values with the Unicode (I think they will run from 0410 to 044f); if they are correct then you can presume the Java is reading the text correctly.
The bit about (i & 7) == 0 ? '\n' : '\0' inserts a newline every 8 places. You can see it works nicely on a Linux console.
campbell@linux-pgix:~/java> java RussianPrinter

0410=А0411=Б0412=В0413=Г0414=Д0415=Е0416=Ж0417=З
0418=И0419=Й041a=К041b=Л041c=М041d=Н041e=О041f=П
0420=Р0421=С0422=Т0423=У0424=Ф0425=Х0426=Ц0427=Ч
0428=Ш0429=Щ042a=Ъ042b=Ы042c=Ь042d=Э042e=Ю042f=Я
0430=а0431=б0432=в0433=г0434=д0435=е0436=ж0437=з
0438=и0439=й043a=к043b=л043c=м043d=н043e=о043f=п
0440=р0441=с0442=т0443=у0444=ф0445=х0446=ц0447=ч
0448=ш0449=щ044a=ъ044b=ы044c=ь044d=э044e=ю044f=я
Try with an IDE like Eclipse or NetBeans which are written in Java and ought to support Unicode for their displays.

[edit]Change 0x043f to 0x044f[/edit]
[ September 25, 2008: Message edited by: Campbell Ritchie ]
Thomas Kiersted
Greenhorn

Joined: Feb 24, 2007
Posts: 22

Right now I'm using the standard Scanner construction:



I'm guessing I need this one:

public Scanner(InputStream source,
String charsetName)

but I'm not sure which charset to use. The online documentation lists these:

Charset
Description
US-ASCIISeven-bit ASCII, a.k.a. ISO646-US, a.k.a. the Basic Latin block of the Unicode character set
ISO-8859-1 ISO Latin Alphabet No. 1, a.k.a. ISO-LATIN-1
UTF-8Eight-bit UCS Transformation Format
UTF-16BESixteen-bit UCS Transformation Format, big-endian byte order
UTF-16LESixteen-bit UCS Transformation Format, little-endian byte order
UTF-16Sixteen-bit UCS Transformation Format, byte order identified by an optional byte-order mark

Would I use UTF-16? And create scanner like this:



I actually tried NetBeans and got gobblydygook there as well, which rather surprised me.
Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 39548
    
  27
Most Unicode files come as UTF-8. The byte order mark -if present in the file- might also provide a clue.
Campbell Ritchie
Sheriff

Joined: Oct 13, 2005
Posts: 36508
    
  16
Try the Scanner without the encoding and print out the hex values of each char using the %x formatting tag and (int) casts on the chars. If that produces nonsense, then you know you need to change the encoding, and, as Ulf suggests, try UTF-8 first.
Thomas Kiersted
Greenhorn

Joined: Feb 24, 2007
Posts: 22

Thanks! Finally got this working - seems there were two main problems.

One was that the NetBeans console, it seems, really does not support Unicode - javax.swing does, however, so I'll just need to brush up my Swing skills.

The other was the encoding - with %x I was able to get the hex values well enough, but I was getting ASCII decimal values and thus way too low of hex values. I couldn't find the byte-order mark, and a variety of charsets were failing, so I eventually decided maybe Notepad's Unicode was subpar. Sure enough, when I saved my file in Microsoft Word, things started working almost right away. Both the UTF-8 and Windows Cyrillic charsets worked perfectly in Java once it was Word that I was editing the file in.

So it looks like as long as I use Word as my text editor I should be okay from here on out - thanks for the help!
Campbell Ritchie
Sheriff

Joined: Oct 13, 2005
Posts: 36508
    
  16
Well done ( ) and thank you for telling us what worked.
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: How to display Cyrillic in Java
 
Similar Threads
Load Unicode Filecontent to JTextArea
Struts: The Complete Reference - Internationalizaion ???
Byte vs Character streams
retain apostrophe in Collection object
Writing txt file to unix