wood burning stoves 2.0*
The moose likes I/O and Streams and the fly likes Japanese character not read or written correctly Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


Win a copy of Android Security Essentials Live Lessons this week in the Android forum!
JavaRanch » Java Forums » Java » I/O and Streams
Bookmark "Japanese character not read or written correctly" Watch "Japanese character not read or written correctly" New topic
Author

Japanese character not read or written correctly

Kevin Tysen
Ranch Hand

Joined: Oct 12, 2005
Posts: 255
My program reads lines from a text file with the method BufferedReader.readLine() and writes to another text file using BufferedWriter.write().
It works without any problems, usually, but when it encountered a certain Japanese character, I got some unexpected results. There was no problem with any other Japanese characters in the file; only this one character caused a problem.
(By "problem", I mean an unexpected result. What I wanted the program to do was read the text in one file and write it to another file, along with some other textual stuff.)
The Japanese character that caused the problem was the "no" in the word "nojo", which means "farm". Here it is in Japanese:

According to what I know, the .readLine() method reads text, 2 bytes for each character, and when it comes to a carriage return or a linefeed character, it considers that to be the end of the line, and stops reading characters. So what I think is that perhaps one of the two bytes of the Japanese character was considered to be a carriage return or a linefeed, or maybe even a null. I don't know.
I ran this program on my Mac, which is OS10.0 and running Java 3.1. Ancient, right? The characters at the end of lines are different in Windows, I know, so on Windows there might be different results.
Anyone have any ideas about what is going on here?
Alan Moore
Ranch Hand

Joined: May 06, 2004
Posts: 262
First, some corrections:

The number of bytes that make up a character is not fixed; it depends on the encoding that's being used to convert between bytes and characters. (Read this.) You're probably thinking of the way Java strings are stored; they use the UTF-16 encoding, which uses two bytes per character (usually--see the article), but that has nothing to do with how the text is stored on disk.

It used to be the case that Macs favored the carriage-return ('\r' or '\u000D') for line separators, but as of version 10 (OSX), Mac OS is based on Linux, which prefers the linefeed ('\n' or '\u000A').

However, it doesn't matter that much what the operating system thinks a line separator should be, because it's the application (in this case, your Java program) that has to read and write the files. Virtually every modern application will accept any of the three major styles of line separator ("\n", "\r", or "\r\n"). BufferedReader is no exception; you can use a different separator at the end of every line, and BufferedReader will handle them correctly.

There is, unfortunately, one very important exception: Windows Notepad. It refuses to recognize anything except the DOS/Windows-style carriage-return+linefeed ("\r\n") line separator. If it encounters a linefeed or carriage-return by itself, Notepad renders it as a rectangle instead of a line break. That's probably not the cause of your problem, since you're using a Mac, but it's useful to know about (not to mention infuriating).

Now that all that's out of the way, we'll need some more info before we can help you. Like, how exactly are you reading and writing the files? What's the exact code you use to construct the Reader and Writer? How do you write the line separators? Do you use BufferedWriter#newLine(), or do you explicitly write a "\r"? And how do you view the contents of the files?
Kevin Tysen
Ranch Hand

Joined: Oct 12, 2005
Posts: 255
Thank you for the reference to the explanation of text encoding. It was very helpful.
My program reads in text from a text file.
This is what part of the text looks like:

農場
farm
fan
fail
field

望む
hope
hold
hour
hop

Then, my program parses the text and makes a few more text files using the text.
This is how the program reads in the text:

BufferedReader br = new BufferedReader(new FileReader(block + ".txt"));
String line = "not null";
while (line != null){
line = br.readLine();
if (line != null){
// Puts the line in a String[] and does some other processing
}
}

This is how the program writes to one of the output files:

for (int i = 0; i < osTotal; i++){
bw.write("<TR #FFFFFF");

}else{
bw.write("#EEEEEE");

}
bw.write(34);
bw.write("><TD>");
bw.newLine();
bw.write(osAra[i].befUnd);
// osAra[i].befUnd is the Japanese words 農場 and 望む
bw.newLine();
bw.write("<FONT #00FF00");
bw.write(34);
bw.write(">");
bw.write(osAra[i].rightAns);
// osAra[i].rightAns is the English translation of the Japanese,
// specifically, farm and hope
bw.write("</FONT>");
bw.newLine();
bw.newLine();
bw.write("</TD></TR>");
bw.newLine();
}

This is what part of the output file looks like. The first four lines are what happened when

the program processed the words 農場 and farm. The last five lines are how the program

processed 望む and hope, which is the same way the rest of the text was processed, and which

is the way I expected the program to work.

<TR ><TD>
・<FONT COLOR="#00FF00">farm</FONT>

</TD></TR>
<TR ><TD>
望む
<FONT COLOR="#00FF00">hope</FONT>

</TD></TR>

As you can see, the 農 (no) of 農場 (nojo) is rendered unreadable, and the elements are

switched around.
Instead of
nojo [linebreak] <FONT COLOR="#00FF00">farm</FONT> [linebreak] [linebreak]
I have
[unreadable] <FONT COLOR="#00FF00">farm</FONT> [linebreak] jo [linebreak]
Kevin Tysen
Ranch Hand

Joined: Oct 12, 2005
Posts: 255
Sorry, the HTML in the message that I sent seems to have been interpreted literally by the web browser, so I will send the last part of the message again.



This is how the program writes to one of the output files:







This is what part of the output file looks like. The first four lines are what happened when
the program processed the words 農場 and farm. The last five lines are how the program
processed 望む and hope, which is the same way the rest of the text was processed, and which
is the way I expected the program to work.




<TR ><TD>
・<FONT COLOR="#00FF00">farm</FONT>

</TD></TR>
<TR ><TD>
望む
<FONT COLOR="#00FF00">hope</FONT>

</TD></TR>




As you can see, the 農 (no) of 農場 (nojo) is rendered unreadable, and the elements are
switched around.
Instead of
nojo [linebreak] <FONT COLOR="#00FF00">farm</FONT> [linebreak] [linebreak]
I have
[unreadable] <FONT COLOR="#00FF00">farm</FONT> [linebreak] jo [linebreak]


Kevin Tysen
Ranch Hand

Joined: Oct 12, 2005
Posts: 255
OK, here is the last part of the message, one more time. I'll change the lessthan and greaterthan signs to brackets so that the computer won't interpret them as HTML.

[TR BGCOLOR="#EEEEEE"][TD]
・[FONT COLOR="#00FF00"]farm[/FONT]

[/TD][/TR]
[TR BGCOLOR="#FFFFFF"][TD]
望む
[FONT COLOR="#00FF00"]hope[/FONT]

[/TD][/TR]

As you can see, the 農 (no) of 農場 (nojo) is rendered unreadable, and the elements are

switched around.
Instead of
nojo [linebreak] [FONT COLOR="#00FF00"]farm[/FONT] [linebreak] [linebreak]
I have
[unreadable] [FONT COLOR="#00FF00"]farm[/FONT] [linebreak] jo [linebreak]
Ernest Friedman-Hill
author and iconoclast
Marshal

Joined: Jul 08, 2003
Posts: 24183
    
  34

There's a checkbox below the text box where you enter your posts called "Disable HTML in the message." Very handy if you want to show HTML code in your message!


[Jess in Action][AskingGoodQuestions]
Alan Moore
Ranch Hand

Joined: May 06, 2004
Posts: 262
FileReader decodes the contents of the file using the system default encoding. What encoding that is depends on the operating system and the locale of whatever computer the code is running on. That means it will be different on different machines, so you shouldn't use the default if you're dealing with anything other than pure ASCII.

Your file contains both ASCII characters and Japanese ideograms, so the encoding has to be one that supports both character sets: the most likely candidates are Shift_JIS and UTF-8. I would try UTF-8 first: And when you create the BufferedWriter, use an OutputStreamWriter and specify "UTF-8" again. If that doesn't work, try "Shift_JIS" for the Reader (but leave the Writer set to UTF-8).

This is just my best guess, based on experience; I can't get enough out of your posts to be more definite. If you still have problems, remember to check "Disable HTML" and "Disable smilies" when you post again (in fact, do what I did and set them to be disabled by default in your "My Profile" page).
Carey Evans
Ranch Hand

Joined: May 27, 2008
Posts: 225

If you can produce very short examples of the correct and incorrect text when run through hexdump -C from the terminal, it should give us (and you) a better idea of what is wrong with the encoding of the generated data. Can you give this a try?
Kevin Tysen
Ranch Hand

Joined: Oct 12, 2005
Posts: 255
Thanks for the advice. Actually, I went to the library yesterday, and I found a book that says that the default character set for UNIX based (MacOSX is UNIX, I believe) computers is EUC-JP, so I will try that, too.

About hexdump, is this how I should use it? For example, if the text I want to display is 農場 then on the command line I should type in

hexdump -C -e "農場"

or rather, not type it all in, but use copy and paste for the 農場 text?
Carey Evans
Ranch Hand

Joined: May 27, 2008
Posts: 225

You can do:But that just shows the text in the encoding Terminal is using (UTF-8 in this case), which should be that same as what ‘locale’ prints.

If you create a text file containing the expected text, and one containing the output from Java, then you can compare the output of hexdump -C on each file and work out what encoding Java and your editor are using.
Kevin Tysen
Ranch Hand

Joined: Oct 12, 2005
Posts: 255
I tried to copy and paste 農場 in the command line window, but I think the command line does not accept anything more than 8 bits. When I pasted 農場 and the other character, I got _ and ? respectively.
I think I'll try making a text file and do hexdump -C on it. Should I do it like this?

% echo myfile.txt hexdump -C
Carey Evans
Ranch Hand

Joined: May 27, 2008
Posts: 225

It's just hexdump -C myfile.txt. You can read the manual page by typing man hexdump, or on the web.
Gabriel Vince
Greenhorn

Joined: Feb 05, 2009
Posts: 24
I believe Alan is right.. if you use only FileReader, you may encounter problems with characters other then standard ascii (e.g. we have the same problem for central european files), I use OutputStreamWriter or InputStreamReader to specify encoding.

There is another approach .. you may use NIO and java.nio.charset.CharsetDecoder / Charsetencoder where you can encode between ByteBuffer and CharBuffer in any charset supported.
Kevin Tysen
Ranch Hand

Joined: Oct 12, 2005
Posts: 255
By the way, is there a way to look at the bytes of a file in Windows, too? Do you do
hexdump -C myfile.txt
in Windows, too, or is there some other command?
Carey Evans
Ranch Hand

Joined: May 27, 2008
Posts: 225

There doesn’t seem to be anything that comes with Windows. The GnuWin32 project provides a package for Windows based on GNU CoreUtils, which includes od, and you can use GNU od like hexdump: od -t x1z filename
Carey Evans
Ranch Hand

Joined: May 27, 2008
Posts: 225

It wasn't that hard, so:
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: Japanese character not read or written correctly
 
Similar Threads
Skipping a blank line
linebreaks
How to Change the File Name for Each Uploaded Files to the Socket Server?
How to save a text file without boxes for newline character?
Help! Problems with a method calling itself...