Win a copy of Re-engineering Legacy Software this week in the Refactoring forum
or Docker in Action in the Cloud/Virtualization forum!
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

Byte vs Character streams

 
nk kumar
Greenhorn
Posts: 6
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I am trying to understand difference between byte and character streams, and was reading

CopyBytes @ http://download.oracle.com/javase/tutorial/essential/io/bytestreams.html
CopyCharacters @ http://download.oracle.com/javase/tutorial/essential/io/charstreams.html

the difference between two mentioned was "CopyCharacters, int variable holds a character value in its last 16 bits; in CopyBytes, the int variable holds a byte value in its last 8 bits"

will I be right in assuming if a character in input file (xanadu.txt) is represented with an integer value >255 (more than 8 bits) will not be written to the output file (outagain.txt) if FileInputStream class in CopyBytes class is used.
 
Stephan van Hulst
Bartender
Pie
Posts: 5432
52
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
It will, the only difference is that the loop in CopyBytes will run twice as much, because instead of reading a character at a time, it reads two bytes.

A character that can be represented with an integer value lower than 256 will still take 16 bits. 8 bits will just be zeroed out.

An InputStreamReader has methods that specialize in "reading" streams, translating bytes into characters we can easily read.
 
nk kumar
Greenhorn
Posts: 6
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Stephan van Hulst wrote:It will, the only difference is that the loop in CopyBytes will run twice as much, because instead of reading a character at a time, it reads two bytes.



Thanks Stephan, but I am unable to see why CopyBytes, while loop will run twice as much. For example if the input file contains 3 characters CopyBytes and CopyCharacters while loop would run 4 times (3 times for 3 characters and 1 for end of file).
 
Stephan van Hulst
Bartender
Pie
Posts: 5432
52
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
If the input file contains three characters, it contains six bytes. Remember that every character consists of two bytes, regardless of whether the character value is lower than 256.

So for CopyBytes, the loop would run six times, not three.
 
nk kumar
Greenhorn
Posts: 6
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Stephan, looks like you are partially correct or may be I am missing something here. So to validate your point I placed a print statement in the while loop to figure out how many times the loop was executed, to my surprise found some interesting behavior below -

with ByteStream -

1. Placed only 3 characters (ABC) in the input file and saved the txt with encoding UTF-8, output to console was 239, 187, 191, 65, 66, 67 (was expecting your answer 0, 65, 0, 66, 0, 67)

2. Saved the same input txt file as above with encoding - unicode, output was 255, 254, 65, 0, 66, 0, 67, 0

3. with encoding Unicode Big Endian, output was 254, 255, 0, 65, 0, 66, 0, 67

4. with Ansi encoding, output was 65, 66, 67


with CharacterStream -

1. UTF-8, output was 239, 187, 191, 65, 66, 67

2. Unicode, output was 255, 254, 65, 0, 66, 0, 67, 0

3. Big Endian, output was 254, 255, 0, 65, 0, 66, 0, 67

4. Ansi, output was 65, 66, 67
 
Campbell Ritchie
Sheriff
Posts: 48453
56
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I think this question is too difficult for "beginning" so I shall move it.
 
nk kumar
Greenhorn
Posts: 6
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Guru' s please weigh in...
 
Stephan van Hulst
Bartender
Pie
Posts: 5432
52
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Could you post the code you're using?
 
nk kumar
Greenhorn
Posts: 6
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Make sure input.txt contains only text ABC




 
Stephan van Hulst
Bartender
Pie
Posts: 5432
52
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Okay I took a closer look at it, and with simple text files they will read the same amount of bytes, namely one.

Take a look at files with strange characters though, for instance Chinese or Japanese text. Then the results you get from printing the read integer will vary.

The short of it is that you should use a Reader when you want to read text files, and an InputStream when you want to read binary data. The reader will translate the bytes to the correct characters, sometimes reading more than one byte at a time, given the proper character set. It should only make a real difference for strange characters though.

I will see if I can come up with some code and an input file where you can actually see the difference. It's hard to determine from the console, so I'll make an example with a text field, because you need a proper font to be able to see the characters.
 
Stephan van Hulst
Bartender
Pie
Posts: 5432
52
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Okay, take a good look. Let me know whether you understand the difference.


 
Kurt Van Etten
Ranch Hand
Posts: 98
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I see that Stephan has already posted an example while I was off playing around with my code, but let me go ahead and post another example here as well.

My understanding of this is that an InputStream reads data byte by byte. A Reader adds a layer on top of that which interprets the bytes as characters, and the result you get will depend on the character set being used. On my own computer, the default character set uses one byte per character, so the number of characters read from a file is going to equal the number of bytes. However, if a 16-bit Unicode character set is being used, each character will require two bytes, so the number of characters will be half the number of bytes.

Here's a little program which reads a one-line file consisting of just the word "Test". First it gets read using a FileInputStream, then using an InputStreamReader with the default character set, and finally using an InputStreamReader with the character set forced to UTF-16.



Here is the output:

Reading file using a FileInputStream

int = 84 char = T
int = 101 char = e
int = 115 char = s
int = 116 char = t

Reading file using a Reader with default Charset
Encoding = Cp1252

int = 84 char = T
int = 101 char = e
int = 115 char = s
int = 116 char = t

Reading file using a Reader with UTF-16 Charset
Encoding = UTF-16

int = 21605 char = ?
int = 29556 char = ?


The final loop only lists half as many characters because it is interpreting pairs of bytes as 16-bit Unicode characters.

 
nk kumar
Greenhorn
Posts: 6
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thanks Stephan and Kurt for your amazing code samples, now I am able to see clear difference between the two.

 
O. Ziggy
Ranch Hand
Posts: 430
Android Debian VI Editor
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Excellent examples!
 
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic