This week's book giveaway is in the OCMJEA forum. We're giving away four copies of OCM Java EE 6 Enterprise Architect Exam Guide and have Paul Allen & Joseph Bambara on-line! See this thread for details.
I am confused with character encoding from the beginning.
ASCII has 0-127 as standard characters and 128-255 is used by different countries for character encoding in their languages.
But how would that encoding be used in computer ? Encodings are set in computer at the time of manufacturing. Now , if I create a new encoding, how can this new encoding be integrated in computer ? How would computer understand this new encoding ?
Does this new encoding need to be installed on computer ?
Is it just some kind of executable file that will run and install a new encoding ? If I want to make this encoding widely available, is there any organization, which handles all the encodings in world ?
Well, no, encodings are not manufactured into the computer. They are software translations from one way of representing a set characters to another. In Java, the encodings translate between Unicode and non-Unicode. Let's say you're reading a document that you know is encoded UTF-8. When you specify that encoding to your Java Reader, it knows that it will have to read from one to three bytes from the source for each character you request. It knows how many to read and how to map these one to three bytes into a two byte Unicode character. If you were an expert at UTF-8 and Unicode, you could easy code the same thing and wrap it around an InputStream. With Readers and encodings though, that work is already done for you.
ETA: ah, yes, Peter is probably right that you are thinking of code pages. According to his link, they were embedded directly in hardware at some point, but I don't think that's true anymore.
Joined: Apr 23, 2009
Thanks for replies.
I understand they were called code pages in MS-DOS times. It does not apply any more.
I am trying to understand this from beginning. So lets say in times of code pages, they had to be embedded in hardware. So if an Gujarati language characters code page had to be embedded in hardware then OEM (Original Equipment Manufacturer) had to embed it in hardware or software. Otherwise once these computers are in India, there is no way Gujarati language code page can be included , if not included by OEM. Is that right ? So in MS-DOS times everything was in hands of OEM only ?
Say, there is a language for which encoding is not created as yet. Now if a new encoding (not a code page) is created for this language, how can it be embedded in software ? How can it be distributed for use ?
Please don't provide wikipedia links, they are more confusing.
nirjari patel wrote:ASCII has 0-127 as standard characters and 128-255 is used by different countries for character encoding in their languages.
But this is completely wrong. Java uses Unicode for its character set, and has done so since it was created over 15 years ago. Even the basic subset of Unicode supports a possible 65,536 characters and the full version allows several million. So the idea of there only being 256 possible characters is obsolete and has been for a long time.
I expect that's why you are finding the Wikipedia article about character sets hard to understand. You have started with some preconceptions which are wrong, and so naturally you find the article hard to square with what you thought you knew. So may I suggest you re-read the Wikipedia article? It may be more complicated than you expected, but where you started from is far too simple.
Joined: Apr 23, 2009
Please forget about 0-255 characters. That is just an example I am taking to make things simpler.
Lets say, I am creating an encoding using unicode, now how to embed it in computer ?
You would use a CharsetProvider. (Follow that link to the API documentation where it explains how to do it.)
Joined: Apr 23, 2009
Thanks for reply.
I have another question. Which is about use of encoding. A new encoding "gurjar" is created, how can a developer use this gurjar encoding ?
By default, a developer is using UTF-8. Now if he wants to use new encoding, how can he use it ? Does he need to just specify encoding in his program or does he need to do coding according to his encoding ? By that I mean, if I have English keyboard and need to display Gujarati language characters using Gurjar encoding how can I do that ? Do I need a special keyboard for that ? If not, then how can I associate letters of new encoding with key board ?
As for how the developer would use the encoding, they would use it exactly as they use any other encoding.
I don't understand why you are asking about keyboards -- keyboards have nothing to do with encodings at all. All of the APIs related to keyboards work with characters, not bytes, so what you get from a keyboard is already Unicode. No charset is necessary.