the difference between two mentioned was "CopyCharacters, int variable holds a character value in its last 16 bits; in CopyBytes, the int variable holds a byte value in its last 8 bits"
will I be right in assuming if a character in input file (xanadu.txt) is represented with an integer value >255 (more than 8 bits) will not be written to the output file (outagain.txt) if FileInputStream class in CopyBytes class is used.
It will, the only difference is that the loop in CopyBytes will run twice as much, because instead of reading a character at a time, it reads two bytes.
A character that can be represented with an integer value lower than 256 will still take 16 bits. 8 bits will just be zeroed out.
An InputStreamReader has methods that specialize in "reading" streams, translating bytes into characters we can easily read.
Joined: Sep 23, 2010
Stephan van Hulst wrote:It will, the only difference is that the loop in CopyBytes will run twice as much, because instead of reading a character at a time, it reads two bytes.
Thanks Stephan, but I am unable to see why CopyBytes, while loop will run twice as much. For example if the input file contains 3 characters CopyBytes and CopyCharacters while loop would run 4 times (3 times for 3 characters and 1 for end of file).
If the input file contains three characters, it contains six bytes. Remember that every character consists of two bytes, regardless of whether the character value is lower than 256.
So for CopyBytes, the loop would run six times, not three.
Joined: Sep 23, 2010
Stephan, looks like you are partially correct or may be I am missing something here. So to validate your point I placed a print statement in the while loop to figure out how many times the loop was executed, to my surprise found some interesting behavior below -
with ByteStream -
1. Placed only 3 characters (ABC) in the input file and saved the txt with encoding UTF-8, output to console was 239, 187, 191, 65, 66, 67 (was expecting your answer 0, 65, 0, 66, 0, 67)
2. Saved the same input txt file as above with encoding - unicode, output was 255, 254, 65, 0, 66, 0, 67, 0
3. with encoding Unicode Big Endian, output was 254, 255, 0, 65, 0, 66, 0, 67
Okay I took a closer look at it, and with simple text files they will read the same amount of bytes, namely one.
Take a look at files with strange characters though, for instance Chinese or Japanese text. Then the results you get from printing the read integer will vary.
The short of it is that you should use a Reader when you want to read text files, and an InputStream when you want to read binary data. The reader will translate the bytes to the correct characters, sometimes reading more than one byte at a time, given the proper character set. It should only make a real difference for strange characters though.
I will see if I can come up with some code and an input file where you can actually see the difference. It's hard to determine from the console, so I'll make an example with a text field, because you need a proper font to be able to see the characters.
I see that Stephan has already posted an example while I was off playing around with my code, but let me go ahead and post another example here as well.
My understanding of this is that an InputStream reads data byte by byte. A Reader adds a layer on top of that which interprets the bytes as characters, and the result you get will depend on the character set being used. On my own computer, the default character set uses one byte per character, so the number of characters read from a file is going to equal the number of bytes. However, if a 16-bit Unicode character set is being used, each character will require two bytes, so the number of characters will be half the number of bytes.
Here's a little program which reads a one-line file consisting of just the word "Test". First it gets read using a FileInputStream, then using an InputStreamReader with the default character set, and finally using an InputStreamReader with the character set forced to UTF-16.