File APIs for Java Developers
Manipulate DOC, XLS, PPT, PDF and many others from your application.
http://aspose.com/file-tools
The moose likes Java in General and the fly likes Byte vs Character streams Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Java » Java in General
Bookmark "Byte vs Character streams" Watch "Byte vs Character streams" New topic
Author

Byte vs Character streams

nk kumar
Greenhorn

Joined: Sep 23, 2010
Posts: 6
I am trying to understand difference between byte and character streams, and was reading

CopyBytes @ http://download.oracle.com/javase/tutorial/essential/io/bytestreams.html
CopyCharacters @ http://download.oracle.com/javase/tutorial/essential/io/charstreams.html

the difference between two mentioned was "CopyCharacters, int variable holds a character value in its last 16 bits; in CopyBytes, the int variable holds a byte value in its last 8 bits"

will I be right in assuming if a character in input file (xanadu.txt) is represented with an integer value >255 (more than 8 bits) will not be written to the output file (outagain.txt) if FileInputStream class in CopyBytes class is used.
Stephan van Hulst
Bartender

Joined: Sep 20, 2010
Posts: 3598
    
  14

It will, the only difference is that the loop in CopyBytes will run twice as much, because instead of reading a character at a time, it reads two bytes.

A character that can be represented with an integer value lower than 256 will still take 16 bits. 8 bits will just be zeroed out.

An InputStreamReader has methods that specialize in "reading" streams, translating bytes into characters we can easily read.
nk kumar
Greenhorn

Joined: Sep 23, 2010
Posts: 6
Stephan van Hulst wrote:It will, the only difference is that the loop in CopyBytes will run twice as much, because instead of reading a character at a time, it reads two bytes.



Thanks Stephan, but I am unable to see why CopyBytes, while loop will run twice as much. For example if the input file contains 3 characters CopyBytes and CopyCharacters while loop would run 4 times (3 times for 3 characters and 1 for end of file).
Stephan van Hulst
Bartender

Joined: Sep 20, 2010
Posts: 3598
    
  14

If the input file contains three characters, it contains six bytes. Remember that every character consists of two bytes, regardless of whether the character value is lower than 256.

So for CopyBytes, the loop would run six times, not three.
nk kumar
Greenhorn

Joined: Sep 23, 2010
Posts: 6
Stephan, looks like you are partially correct or may be I am missing something here. So to validate your point I placed a print statement in the while loop to figure out how many times the loop was executed, to my surprise found some interesting behavior below -

with ByteStream -

1. Placed only 3 characters (ABC) in the input file and saved the txt with encoding UTF-8, output to console was 239, 187, 191, 65, 66, 67 (was expecting your answer 0, 65, 0, 66, 0, 67)

2. Saved the same input txt file as above with encoding - unicode, output was 255, 254, 65, 0, 66, 0, 67, 0

3. with encoding Unicode Big Endian, output was 254, 255, 0, 65, 0, 66, 0, 67

4. with Ansi encoding, output was 65, 66, 67


with CharacterStream -

1. UTF-8, output was 239, 187, 191, 65, 66, 67

2. Unicode, output was 255, 254, 65, 0, 66, 0, 67, 0

3. Big Endian, output was 254, 255, 0, 65, 0, 66, 0, 67

4. Ansi, output was 65, 66, 67
Campbell Ritchie
Sheriff

Joined: Oct 13, 2005
Posts: 38007
    
  22
I think this question is too difficult for "beginning" so I shall move it.
nk kumar
Greenhorn

Joined: Sep 23, 2010
Posts: 6
Guru' s please weigh in...
Stephan van Hulst
Bartender

Joined: Sep 20, 2010
Posts: 3598
    
  14

Could you post the code you're using?
nk kumar
Greenhorn

Joined: Sep 23, 2010
Posts: 6
Make sure input.txt contains only text ABC




Stephan van Hulst
Bartender

Joined: Sep 20, 2010
Posts: 3598
    
  14

Okay I took a closer look at it, and with simple text files they will read the same amount of bytes, namely one.

Take a look at files with strange characters though, for instance Chinese or Japanese text. Then the results you get from printing the read integer will vary.

The short of it is that you should use a Reader when you want to read text files, and an InputStream when you want to read binary data. The reader will translate the bytes to the correct characters, sometimes reading more than one byte at a time, given the proper character set. It should only make a real difference for strange characters though.

I will see if I can come up with some code and an input file where you can actually see the difference. It's hard to determine from the console, so I'll make an example with a text field, because you need a proper font to be able to see the characters.
Stephan van Hulst
Bartender

Joined: Sep 20, 2010
Posts: 3598
    
  14

Okay, take a good look. Let me know whether you understand the difference.


Kurt Van Etten
Ranch Hand

Joined: Sep 07, 2010
Posts: 98
I see that Stephan has already posted an example while I was off playing around with my code, but let me go ahead and post another example here as well.

My understanding of this is that an InputStream reads data byte by byte. A Reader adds a layer on top of that which interprets the bytes as characters, and the result you get will depend on the character set being used. On my own computer, the default character set uses one byte per character, so the number of characters read from a file is going to equal the number of bytes. However, if a 16-bit Unicode character set is being used, each character will require two bytes, so the number of characters will be half the number of bytes.

Here's a little program which reads a one-line file consisting of just the word "Test". First it gets read using a FileInputStream, then using an InputStreamReader with the default character set, and finally using an InputStreamReader with the character set forced to UTF-16.



Here is the output:

Reading file using a FileInputStream

int = 84 char = T
int = 101 char = e
int = 115 char = s
int = 116 char = t

Reading file using a Reader with default Charset
Encoding = Cp1252

int = 84 char = T
int = 101 char = e
int = 115 char = s
int = 116 char = t

Reading file using a Reader with UTF-16 Charset
Encoding = UTF-16

int = 21605 char = ?
int = 29556 char = ?


The final loop only lists half as many characters because it is interpreting pairs of bytes as 16-bit Unicode characters.

nk kumar
Greenhorn

Joined: Sep 23, 2010
Posts: 6
Thanks Stephan and Kurt for your amazing code samples, now I am able to see clear difference between the two.

O. Ziggy
Ranch Hand

Joined: Oct 02, 2005
Posts: 430

Excellent examples!
 
With a little knowledge, a cast iron skillet is non-stick and lasts a lifetime.
 
subject: Byte vs Character streams
 
Similar Threads
Byte streams
How to Watch directory in Swing GUI?
Differences between FileInputStream FileReader
neep help - files inside folder structure
files