Not really sure where to put this - thought it would be easy enough but I can't figure out how. Basically I need to be able to display Cyrllic as well as Roman characters, but right now I can only display Roman ones. And I can't figure out how to get Cyrllic. I'm importing from a Unicode text file using Scanner, and I think it's being read in as Unicode, but am not sure. The problem is, whenever I print it out, all I get is gobblydygook. This happens whether I print to the console (tested on both Windows Command Prompt and OSX Terminal), or if I label a Button with the Cyrillic text (using the Abstract Windowing Toolkit to create the button).
If it were just the command prompt that had a problem, I'd say it's Windows and OSX that are at fault, but since it fails even when I try to display it using the AWT I'm thinking I need to change something within Java. Do I need an international edition? If so, where would I download it (Google didn't turn one up)? Or do I just need to import some package to make Unicode magically work? Or is Scanner not the way to read in Unicode non-Roman characters?
Platform is Windows XP SP3. I'd prefer if it worked on OSX as well, but I'd settle for just Windows. The Cyrillic text is displayed correctly in Notepad, and I have chosen "Unicode" uncoding (other options are Unicode big endian and UTF-8 - UTF-8 causes an exception and big endian results in slightly different gobblydygook).
Thanks - never thought such a seemingly simply task would be so difficult!
Java is supposed to support the whole of Unicode, but the command prompt on Windows only seems to support ASCII; it won't even print � correctly for me.
Suggest you scan a little of the file, then split your text into chars with the String#toCharArray method then print each char using a %x tag so it comes out in hex. Then compare the values with the Unicode (I think they will run from 0410 to 044f); if they are correct then you can presume the Java is reading the text correctly. The bit about (i & 7) == 0 ? '\n' : '\0' inserts a newline every 8 places. You can see it works nicely on a Linux console.
Right now I'm using the standard Scanner construction:
I'm guessing I need this one:
public Scanner(InputStream source, String charsetName)
but I'm not sure which charset to use. The online documentation lists these:
Charset Description US-ASCIISeven-bit ASCII, a.k.a. ISO646-US, a.k.a. the Basic Latin block of the Unicode character set ISO-8859-1 ISO Latin Alphabet No. 1, a.k.a. ISO-LATIN-1 UTF-8Eight-bit UCS Transformation Format UTF-16BESixteen-bit UCS Transformation Format, big-endian byte order UTF-16LESixteen-bit UCS Transformation Format, little-endian byte order UTF-16Sixteen-bit UCS Transformation Format, byte order identified by an optional byte-order mark
Would I use UTF-16? And create scanner like this:
I actually tried NetBeans and got gobblydygook there as well, which rather surprised me.
Try the Scanner without the encoding and print out the hex values of each char using the %x formatting tag and (int) casts on the chars. If that produces nonsense, then you know you need to change the encoding, and, as Ulf suggests, try UTF-8 first.
Thanks! Finally got this working - seems there were two main problems.
One was that the NetBeans console, it seems, really does not support Unicode - javax.swing does, however, so I'll just need to brush up my Swing skills.
The other was the encoding - with %x I was able to get the hex values well enough, but I was getting ASCII decimal values and thus way too low of hex values. I couldn't find the byte-order mark, and a variety of charsets were failing, so I eventually decided maybe Notepad's Unicode was subpar. Sure enough, when I saved my file in Microsoft Word, things started working almost right away. Both the UTF-8 and Windows Cyrillic charsets worked perfectly in Java once it was Word that I was editing the file in.
So it looks like as long as I use Word as my text editor I should be okay from here on out - thanks for the help!