aspose file tools*
The moose likes I/O and Streams and the fly likes 1 Character seems  to be written as one byte Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


Win a copy of Spring in Action this week in the Spring forum!
JavaRanch » Java Forums » Java » I/O and Streams
Bookmark "1 Character seems  to be written as one byte" Watch "1 Character seems  to be written as one byte" New topic
Author

1 Character seems to be written as one byte

Sev Zaslavsky
Greenhorn

Joined: Nov 19, 2008
Posts: 7
All along it has been hammered into my head that in java, characters are Unicode and they occupy 2 bytes, but it seems as if FileWriter does not fully agree.

So I tried something basic - I wrote the little program below to write a character and read it back. Based on the output of the "dir" command in Vista, it seems that it's writing one byte, not two as I expected. I even tried using the PrintWriter instead and I get the same result. Also any characters beyond \u007F seem to be written and read back as Ascii 63.

Can anyone explain whats going on here?

import java.io.*;
class Writer2 {
public static void main(String [] args) {
char[] in = new char[50]; // to store input
int size = 0;
try {
File file = new File( "fileWrite2.txt");
FileWriter fw = new FileWriter(file);
fw.write('\u0100');
fw.flush();
fw.close();
FileReader fr = new FileReader(file);
size = fr.read(in);
System.out.print(size + " "); // how many bytes read
for(char c : in) // print the array
{
System.out.println(c + "<->" + Integer.toString(c));
}
fr.close(); // again, always close
} catch(IOException e) { }
}
}
Satish Chilukuri
Ranch Hand

Joined: Jun 23, 2005
Posts: 266
It seems FileWriter doesn't use Unicode encoding by default. You can check the default encoding by printing FileWriter.getEncoding(). Try using OutputStreamWriter and specifying the encoding explicitly:

OutputStreamWriter fw = new OutputStreamWriter(new FileOutputStream(file),"UTF-8");
Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 42264
    
  64
To elaborate on what Satish said, if you don't specify the encoding during I/O, then the platform default encoding will be used. That's CP-1252 (I think) on Windows, MacRoman on OS X, and something else again on other variants of Unix/Linux. Rarely will it be some form of Unicode.


Ping & DNS - my free Android networking tools app
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: 1 Character seems to be written as one byte