Originally posted by Jaz Chana:
From the beginning, all data has to eventually be transformed into 1s and 0s so that the computer can understand the data. This is where character encoding comes in. Now a bit is either a 1 or a 0. A byte is an octet/8 bit register, so a byte could look like 00001111. However this is useless for a human reader so it has to be converted into character data using some sort of character set and encoding. The most basic as I understand is ASCII which is a representation of 1 byte to 1 character (correct?).
That is correct yes, the ASCII code set is 1 character per byte
A byte is an '8-bit signed two's complement integer' and a char is a '16-bit Unicode character'. To me that means that a char is two bytes, with unicode character encoding. Or a byte is half a char without any encoding.
Is this true? It doesn't sound correct.
The first part is correct, the Unicode character set supports characters that can be represented with a single byte (Think about how many unique characters can be represented by a single byte...).
Now the use of a FileInputStream to read characters is _not_ recommended for this very reason. If you are reading in certain unicode characters (ones that take 2 bytes to represent) then reading off just one byte is only going to give you 1/2 a character. It's best to use FileReader instead.
If that is true then why is it, for example, that the following application outputs the numbers: "104 101 108 108 111 32 119 111 114 108 100" (hex data I'm assuming), when I would have expected to see 0s and 1s?
by the way, the text in the file is "Hello World"
The read() method is pulling back bytes from the stream as an int, so you're getting a decimal representation of the byte you are reading in. If you compare the numbers you are getting to a ascii decimal chart you'll see how it matches up to actual characters.
also why is it that a character stream in the next program pulling data from the same file outputs the same set of numbers?
I would have expected the string to output as it was to be represented, or at least be different from the byte data. It is after all using a different encoding.
Can someone please explain why?
Basically all you are getting is a string representation of the number that is the int representation of the byte. Your code isn't actually ding the character set conversion 104 to "H" for example. I would take a look at the FileReader class to get the desired behaviour.
Hope that all helps!
Originally posted by Jaz Chana:
Thank you very much for your reply. Its has cleared things up a little. I have many more questions, but firstly I should clear something up. The code for the character data read was wrong. The code I was really referring to was this;
the other code actually does print the text output. I'll come back to the above in a sec, but firstly i have some other questions.
OK, you're using a method there that's not quite going to give you what you want. If you follow the Javadoc for the FileReader API you'll see that the read() method you are using is still returning the characters as an int. You want to use an alternative read() method in which you pass in a character array that automatically gets filled.
Okay, now I understand that the hex data relates to alphabet chars converted by the ascii table. But where does it state to use hex encoding?
I thought that the encoding is determined by the stream and not the data type used to store it. I want it to use ASCII or unicode encoding.
Hex (and its corresponding decimal equivalent) are simply the next 'storage level', specifying ASCII/Unicode encoding simply tells it how many bytes to use for a character and what the mapping to the actual character should be.
Why does java use an int to read the data in? Surely it would be better off using a byte or a char value. The justification in the documentation (http://java.sun.com/docs/books/tutorial/essential/io/bytestreams.html) states that 'Using a int as a return type allows read() to use -1 to indicate that it has reached the end of the stream.' But what is wrong with using a byte or char and having null indicate the end of file, or even a String? In fact, how is it that char/byte data can at all be represented by
It just doesn't make sense to me. :~
Didn't make sense to a lot of us when the first I/O APIs came out .
Basically the Stream classes are designed to work at the 'lowest' level just above bits, which allows programmers or higher level API calls maximum flexibility. Your case of wanting to read the characters in a human readable form is only one of many use cases for those Classes/methods (I can give examples of other cases if that helps).
Leaving that aside for a moment, I am going to take this discussion up a level and talk about Blob (Binary Large Object) and Clob (Character Large Object) data. This whole problem came about as a result of my attempt to store some large xml and string data into a mysql database. I couldn't decided whether to use Clob or Blob in a mysql data base (clob is the same as longtext). If i understand correct, Blob would use ASCII encoding and Clob would use unicode. Since unicode is a superset of ASCII, you can store all ASCII characters and more in a Clob.
Hmm, I don't know much about mysql but you're correct about unicode being a superset of ASCII
Given what I have learned, it makes more sense to store them as Clob. However, since there are more bytes per clob (since Unicode can uses up to 4 bytes per char) it uses more memory. Hence one advantage of a Blob over a Clob is use of memory.
To go back to the original problem, how large is your XML? You may find that BLOB or CLOB storage is not required.
Is this the same for char and byte data in java? Is byte data represented as ASCII? I've heard of byte data being referred to as binary. In reality that's not true is it? byte data is as close to binary as char data. Both byte and char data are encoded, the only difference is that they use different encodings. Is this correct? Is this the same for all data types? The only difference between them is the encoding? Is this the reason why an int can represent a byte?
I'll answer it this way. Bytes are simply the lowest level building blocks for data, they can represent integers, characters etc. char data is at a slightly higher level as several bytes can make up a char, you'll find that byte and char Classes/methods seem almost the same which adds to the confusion, but the _are_ different.
The usage of an int to represent a byte is just a common low level way to represent the 8bit register. It has small memory footprint and is easy to manipulate (for example if you're going to blindly copy the contents of a file you wouldn't want to convert the bytes into actual characters and then copy, you'd just want to do a raw copy).
Okay one last area is how the information is stored. So far we've been talking about taking data and representing it i different ways, but we haven't really delved into the initial state of the data. At the beginning we stated that everything is represented on a computer by 1s and 0s. If this is true than to a computer there is no difference between character and byte data. The difference only becomes apparent when the information has to be displayed.
At the lowest level that is correct yes, computers only understand 1's and 0's, it's the programming data constructs and languages on top of that that convert the bytes into meaningful things.
When people talk about storing data I sometimes hear that they want to store that information as binary/byte data or as an array of bytes. For example, taking a string converting it to an array of bytes and storing it seems common place on the net. But why would you want to do that? Surely you would lose data if you did that? And considering that a byte could potentially split a char into 4 (assuming that the original data was unicode), is it true to say that the data would be corrupted and not possible to convert back?
Now you're getting to the heart of the matter! Putting Strings back into a byte array is common, as you know bytes are the lowest level so again it's a matter of efficiency etc to deal with that low level when you are performing low level operations (like copy). More often than not this low level behaviour is hidden by a higher level API call. You definitely don't lose data although you can 'corrupt' the data by reading it back in using the wrong encoding.
For example you encode (unicode) a complex character from ancient egypt and that gets stored as 4 bytes [100, 230, 5, 4]. You then read it back in as ascii and you get 4 separate characters 100, 230, 5 and 4 (because ascii encioding says 1 bytes == 1 character), however if you used the right reader/encoding (unicode) it knows to retrieve the full 4 bytes [100, 230, 5, 4] as a character.
I'd also recommend looking at the Sun tutorial and the Javaranch FAQ on this.
As you can see I am extremely confused on the subject. I feel like i am on the verge of understanding, but there are some fundamental concepts that alude me. The biggest of which are around storage, retrieval and display.
Actually I think you're doing extremely well, you're looking at this at a much lower level than most people would! For example, most people just use a high level API such as hibernate to save XML to a database, you'd literally go and it's done.
If you haven't thought about it already I'd recommend taking a few basic Computer Science papers, they deal with subjects like this and I suspect you'd be very good at it!
Originally posted by Jaz Chana:
Thank you very much, you've been instrumental in helping me understand. This will definitely be my first port of call the next time I have an issue.
I've only just started getting involved in this community recently, but I've found to be an excellent place to ask questions no matter what your level of experience. I've already learned some new things and been humbled on more than one occasion .
I think I need to go over it a few times before I have a solid understanding. I am now realizing that actually I am not asking one question on one area, but several at once. Therefore I need to take time in digesting everything and asking those questions again.
However I have now enough information in solving the issue I have. The data I should store as Clob I think. The xml and the String data are extremely large. We are taking a possibility of several megabytes a string or xml data. I don't know if the data will contain more than just ASCII data, so to be on the safe side I think i should store it as Clob.
What do you think?
It really depends on what you are storing those docs for. Do you want to be able to search on the contents of those docs?
* BLOB and CLOB are not easily human readable forms (and therefore are not searchable either), are you really gaining anything over storing the XML on a file system? Or do you have other meta data that you are storing in that same table which will assist you with searches etc?
* There are a number of XML extensions that some database vendors offer (mysql might be one of them) where you can actually store your XML in an 'XML datastore' and you can use XPath etc to directly search on that data.
* PS I'm assuming you're in academic research.
* You might find this article handy as well
I will read those articles you sent. The one on the sun forums I have read before, but I think i need to go over it again. The java ranch article is new to me and I will be going through it with a fine comb. As far as the computer science papers, are there any free online resources you can recommend?
You can find a site of links and resources Here. I'd Google for others as well and see what suits your personal preference.
I was thinking about purchasing this book:
do you have any opinions on it?
Yes actually, it's very good, the author is pretty well known in Java and XML circles, if you Google him you'll find his Java and XML sites which also contain many resources. In general you can almost always trust an Oreillys title (when it comes to Java anyhow)