aspose file tools*
The moose likes General Computing and the fly likes character Encoding issues Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


Win a copy of EJB 3 in Action this week in the EJB and other Java EE Technologies forum!
JavaRanch » Java Forums » Engineering » General Computing
Bookmark "character Encoding issues" Watch "character Encoding issues" New topic
Author

character Encoding issues

nirjari patel
Ranch Hand

Joined: Apr 23, 2009
Posts: 357

I am confused with character encoding from the beginning.

ASCII has 0-127 as standard characters and 128-255 is used by different countries for character encoding in their languages.

But how would that encoding be used in computer ? Encodings are set in computer at the time of manufacturing. Now , if I create a new encoding, how can this new encoding be integrated in computer ? How would computer understand this new encoding ?

Does this new encoding need to be installed on computer ?

Is it just some kind of executable file that will run and install a new encoding ? If I want to make this encoding widely available, is there any organization, which handles all the encodings in world ?

Please answer in detail.

Thanks
Peter Johnson
author
Bartender

Joined: May 14, 2008
Posts: 5772
    
    7

I think you are confusing character encoding and code pages (I'm not all that clear on it myself). Perhaps wikipedia will help:
http://en.wikipedia.org/wiki/Character_encoding


JBoss In Action
Greg Charles
Sheriff

Joined: Oct 01, 2001
Posts: 2771
    
  10

Well, no, encodings are not manufactured into the computer. They are software translations from one way of representing a set characters to another. In Java, the encodings translate between Unicode and non-Unicode. Let's say you're reading a document that you know is encoded UTF-8. When you specify that encoding to your Java Reader, it knows that it will have to read from one to three bytes from the source for each character you request. It knows how many to read and how to map these one to three bytes into a two byte Unicode character. If you were an expert at UTF-8 and Unicode, you could easy code the same thing and wrap it around an InputStream. With Readers and encodings though, that work is already done for you.

ETA: ah, yes, Peter is probably right that you are thinking of code pages. According to his link, they were embedded directly in hardware at some point, but I don't think that's true anymore.
nirjari patel
Ranch Hand

Joined: Apr 23, 2009
Posts: 357
Thanks for replies.

I understand they were called code pages in MS-DOS times. It does not apply any more.

I am trying to understand this from beginning. So lets say in times of code pages, they had to be embedded in hardware. So if an Gujarati language characters code page had to be embedded in hardware then OEM (Original Equipment Manufacturer) had to embed it in hardware or software. Otherwise once these computers are in India, there is no way Gujarati language code page can be included , if not included by OEM. Is that right ? So in MS-DOS times everything was in hands of OEM only ?

Say, there is a language for which encoding is not created as yet. Now if a new encoding (not a code page) is created for this language, how can it be embedded in software ? How can it be distributed for use ?

Please don't provide wikipedia links, they are more confusing.

Thanks
Paul Clapham
Bartender

Joined: Oct 14, 2005
Posts: 18120
    
    8

nirjari patel wrote:ASCII has 0-127 as standard characters and 128-255 is used by different countries for character encoding in their languages.


But this is completely wrong. Java uses Unicode for its character set, and has done so since it was created over 15 years ago. Even the basic subset of Unicode supports a possible 65,536 characters and the full version allows several million. So the idea of there only being 256 possible characters is obsolete and has been for a long time.

I expect that's why you are finding the Wikipedia article about character sets hard to understand. You have started with some preconceptions which are wrong, and so naturally you find the article hard to square with what you thought you knew. So may I suggest you re-read the Wikipedia article? It may be more complicated than you expected, but where you started from is far too simple.
nirjari patel
Ranch Hand

Joined: Apr 23, 2009
Posts: 357
Please forget about 0-255 characters. That is just an example I am taking to make things simpler.

Lets say, I am creating an encoding using unicode, now how to embed it in computer ?

Thanks
Paul Clapham
Bartender

Joined: Oct 14, 2005
Posts: 18120
    
    8

You would use a CharsetProvider. (Follow that link to the API documentation where it explains how to do it.)
nirjari patel
Ranch Hand

Joined: Apr 23, 2009
Posts: 357
Thanks for reply.

I have another question. Which is about use of encoding. A new encoding "gurjar" is created, how can a developer use this gurjar encoding ?

By default, a developer is using UTF-8. Now if he wants to use new encoding, how can he use it ? Does he need to just specify encoding in his program or does he need to do coding according to his encoding ? By that I mean, if I have English keyboard and need to display Gujarati language characters using Gurjar encoding how can I do that ? Do I need a special keyboard for that ? If not, then how can I associate letters of new encoding with key board ?

Thanks
Paul Clapham
Bartender

Joined: Oct 14, 2005
Posts: 18120
    
    8

As for how the developer would use the encoding, they would use it exactly as they use any other encoding.

I don't understand why you are asking about keyboards -- keyboards have nothing to do with encodings at all. All of the APIs related to keyboards work with characters, not bytes, so what you get from a keyboard is already Unicode. No charset is necessary.
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: character Encoding issues
 
Similar Threads
PDFTextStripper returning null for all the japanese text in the PDF
JSP in websphere throwing illegal state exception
Code conversions
Unicode & Platform's default character encoding
Problem with รจ character in Java.