Mans Usman wrote:why utf-8 not working?
Campbell Ritchie wrote:Welcome tp the Ranch
Campbell Ritchie wrote:
Your question is too difficult for the “Beginning” forum, so I shall move you.
Why are you reading your file with classes with names ending in Stream? For text files you should use a FileReader and BufferedReader (old style) or a Scanner. You should also use try with resources to close the reader, but don't close anything reading System.in. You will find more about the newer style of reading files in the Java™ Tutorials.
Matthew Bendford wrote:
Mans Usman wrote:why utf-8 not working?
Because even the modern command line interpreter cmd.exe doesn't know utf-8 but only the old codepages back from DOS era. If you want proper unicode support please use PowerShell or build you own using a gui like swing.
Stephan van Hulst wrote:Open text files this way:
The ISO 8859-1 encoding is just an example. You can pick any encoding you want. If you want to use UTF-8, you can just omit the entire second argument, because UTF-8 is the default.
The secret of how to be miserable is to constantly expect things are going to happen the way that they are "supposed" to happen.
You can have faith, which carries the understanding that you may be disappointed. Then there's being a willfully-blind idiot, which virtually guarantees it.
Stephan van Hulst wrote:Open text files this way:
Campbell Ritchie wrote:Try using classes designed for text files. Look at the Java™ Tutoials link I gave you earlier.
The read() method, which I myself avoid like the plague, doesn't read a char. So you can expect incorrect results whenever you have characters with a Unicode value ≥ 0x80 (=128).
Many text file reading classes have constructors accepting the encoding. Look at this Scanner(...) constructor.
Tim Holloway wrote:As was mentioned earlier, the code page in effect for the output device(s) you plan to target is critical. It has to be compatible with the code page you are outputting under.
I'm not sure about Windows command-line windows. Windows web browsers used to use CP-2561(?) and I think the command-line windows did as well. That's for English-language Windows distros.
Mans Usman wrote:Your tip didn't work
...
still doesn't recognize the cyrillic text
Stephan van Hulst wrote:
Then the file you are reading from is not UTF-8 encoded. The encoding that you use to read the file must match the encoding of the file. I suspect your file is encoded using either KZ-1048 or CP-1251. Try reading the file with any of the following three:
Charset.forName("KZ-1048") Charset.forName("CP-1251") Charset.defaultCharset()
Stephan van Hulst wrote:Try the Charset.defaultCharset() option anyway, I'm curious what happens.
Stephan van Hulst wrote:Try the Charset.defaultCharset() option anyway . . .
My JShell wrote:jshell> System.out.println(Charset.defaultCharset());
UTF-8
Stephan van Hulst wrote:Usman, could you run System.out.println(Charset.defaultCharset()); like Campbell did? I'd like to know what the encoding we were looking for was.
Stephan van Hulst wrote:In your previous issue, we saw that the existing file content is not encoded as UTF-8, but as Windows-1251, even though you said you made sure the file was encoded as UTF-8. Couldn't the same issue be at play here?
I'm guessing that whatever text editor you're using to look at the file, it first looks at a bit of your existing text which was encoded as Windows-1251, and then tries to interpret the entire file as Windows-1251. The text that you append to the file is being encoded as UTF-8, which is then wrongly interpreted and mangled by the text editor.
Restore the text file to what it was before you ran your code, and then try running the same code a second time, except replace StandardCharsets.UTF_8 on line 10 with Charset.forName("windows-1251").
Stephan van Hulst wrote:In your previous issue, we saw that the existing file content is not encoded as UTF-8, but as Windows-1251, even though you said you made sure the file was encoded as UTF-8. Couldn't the same issue be at play here?
I'm guessing that whatever text editor you're using to look at the file, it first looks at a bit of your existing text which was encoded as Windows-1251, and then tries to interpret the entire file as Windows-1251. The text that you append to the file is being encoded as UTF-8, which is then wrongly interpreted and mangled by the text editor.
Restore the text file to what it was before you ran your code, and then try running the same code a second time, except replace StandardCharsets.UTF_8 on line 10 with Charset.forName("windows-1251").
Stephan van Hulst wrote:I'll check it tomorrow. Right now I can only tell you one more thing, but it sadly won't help you with your problem. The following lines:
do absolutely nothing.
You're first converting a string (which is a sequence of abstract characters), to a sequence of of UTF-8 code units. Then you convert the sequence of UTF-8 code units back to a string. You end up with the exact same string you had before.
Matthew Bendford wrote:@OP
Again - as already mentioned - CMD.exe DOES NOT KNOW unicode/utf-8 - you just CAN NOT set proper utf-8 as codepage - IT DOES NOT WORK!
Please use PowerShell! It's the only option on windows to use proper unicode unless you write your very own CLI.
Why your text file ends up garbage:
String.getBytes("UTF-8");
The problem is that the string you try to read in as UTF-8 IS NOT proper utf-8 and hence the decoding from CHAR to BYTE fails. Have a look what String.toCharArray() gets you and print it like this:
You will see that what you read in as proper 16bit unsigned chars doesn't match utf-8 but some old dos-era codepage - which could be something like CP855.
TLDR: The root cause of your problem is using CMD and trying to set it to unicode - which isn't supported. M$ never bothered to implement it properly as they introduced powershell with NT6/Vista.
Stephan van Hulst wrote:In your previous issue, we saw that the existing file content is not encoded as UTF-8, but as Windows-1251, even though you said you made sure the file was encoded as UTF-8. Couldn't the same issue be at play here?
The secret of how to be miserable is to constantly expect things are going to happen the way that they are "supposed" to happen.
You can have faith, which carries the understanding that you may be disappointed. Then there's being a willfully-blind idiot, which virtually guarantees it.
With a little knowledge, a cast iron skillet is non-stick and lasts a lifetime. |