File APIs for Java Developers
Manipulate DOC, XLS, PPT, PDF and many others from your application.
http://aspose.com/file-tools
The moose likes Java in General and the fly likes any hints for creating &/or using existing UNICODE convertor/processor? Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


Win a copy of OCA/OCP Java SE 7 Programmer I & II Study Guide this week in the OCPJP forum!
JavaRanch » Java Forums » Java » Java in General
Bookmark "any hints for creating &/or using existing UNICODE convertor/processor?" Watch "any hints for creating &/or using existing UNICODE convertor/processor?" New topic
Author

any hints for creating &/or using existing UNICODE convertor/processor?

Guennadiy VANIN
Ranch Hand

Joined: Aug 30, 2001
Posts: 898
For ex., I would like to convert UNICODE codes to ASCII and other encodings
Aleks V. Pascoal
Ranch Hand

Joined: Apr 21, 2002
Posts: 73
My Friend,
you can easily convert a String to a diferent CharSet using:
String p = "Any string";
p = new String(p.getBytes("UTF8"));
I would like to know how to check what is the current CharSet used by the JVM.
sheril she
Greenhorn

Joined: Oct 08, 2002
Posts: 12
just check out if this works

String charset = response.getCharacterEncoding();
Guennadiy VANIN
Ranch Hand

Joined: Aug 30, 2001
Posts: 898
Alex and Sheril,
I know how to program, I asked, really, abt ready processor with such functionality.
My name is not Friend, it, the name, usually appears at the left sidebar and/or under the post.

String p = "Any string";
p = new String(p.getBytes("UTF8"));

This is not correct, abt any string. Try ANY symbol in Cyrillic, to see. JVM uses UNICODE, i.e. 2-bytes/symbol. You may directly write your program in unicodes and it is the same for javac!
[ October 29, 2002: Message edited by: G Vanin ]
Jim Yingst
Wanderer
Sheriff

Joined: Jan 30, 2000
Posts: 18671
Come on, Guennadii - "my friend" is a simple friendly greeting. It doesn't imply your name is "friend" any more than it implies your name is "my". Will you object to "you" next? Please, lighten up. No offense was intended.
Java does have built-in functionality for this, and getBytes() is part of it. Unfortunately the example shown is incorrect - a better one would be:

The problem in the original code is that while getBytes() converted Unicode to UTF-8, the new String(byte[]) constructor (probably) did not use UTF-8. Instead it used the default encoding on your system - whatever that may be. On Windows systems in the Americas and Western Europe it's usually Cp-1252 (Windows Latin-1).
You can also use an OutputStreamWriter to convert Unicode to other encodings, and an InputStreamReader to convert other encodings to Unicode. See the constructors which accept a String encoding argument, or a CharSet (in 1.4).
I would like to know how to check what is the current CharSet used by the JVM.
Annoyingly Java doesn't seem to directly provide this info. Sheril's response tells you how a server is configured to respond to HTTP requests, which isn't necessarily the same thing. (And what if you're not even running a server?) The best workaround I have to find the system default encoding is:

[ October 29, 2002: Message edited by: Jim Yingst ]

"I'm not back." - Bill Harding, Twister
Cindy Glass
"The Hood"
Sheriff

Joined: Sep 29, 2000
Posts: 8521
Originally posted by G Vanin:
the name, usually appears at the left sidebar.

Except yours of course. :roll: He may not have want to call you G and most people do NOT put their name on the bottom of a post.
You COULD change your display name . . . . .


"JavaRanch, where the deer and the Certified play" - David O'Meara
Guennadiy VANIN
Ranch Hand

Joined: Aug 30, 2001
Posts: 898
Jim,
I believe that characters in Java are treated as 16-bit UniCode characters (Reader, Writer). They do not depend on particular encoding format because they are in UNICODE.
I have a notion that Latin-1 encoding (that used in US, Europe) is “8859_1” (ISO 8859-1)(Latin-1 may be get/verified by calling
System.getproperty(“file.encoding” ;)

). Anyway it is not UTF.
It is always possible to enforce another encoding conversion through byte streams (InputStream, OutputStream). There are bridges between bytestreams and character-streams: InputStreamReader, OutputStreamWriter. They are character stream objects (Reader or Writer) that take byte streams (InputStream or OutputStream), as well as, possibly in addition, “encoding”. OK.
response.getCharacterEncoding();
Character, produced outside of Java, encoding certainly may be in any encoding. This depends on OS, application and/or its configuration and even on processor. Who knows the origin of our streams (is it a file created in Taiwan, sorry in China?) O-o-o-h, I did not intend to discuss anything of this (please refer to my original question. Anyway I repeated it in http://www.coderanch.com/t/113264/HTML-JavaScript/If-anybody-knows-any-text)
What I could not get from all your deviations, sorry arguments:
1)what are those UTF8 examples all about? Why UTF8?
2)Can you explain me abt Cp-1252? What is its relation to Latin-1?
[ November 06, 2002: Message edited by: G Vanin - just changed <blockquote> to
]
[ November 06, 2002: Message edited by: G Vanin ]
Guennadiy VANIN
Ranch Hand

Joined: Aug 30, 2001
Posts: 898
Cindy and Jim,
most people do NOT put their name on the bottom of a post.
then somebody ALWAYS put it on the left.
Sounds like my name was pages away from 1-line post. See more above
Except yours of course

That hurts. Where from comes such a terrible and unfair suspicion abt forgering my names?.
You COULD change your display name . . . .

<b>NEVER</b> I already tried once, and after that Javaranch adds a “greenhorn” to my names. I’d rather prefer My Friend.
My names are already automatically augmented by a “member” in each post.. Ask Russians what translation does mean literally.
Then, how would my fans, from bartenders, find me (to call me ”jerks” without capitalization and “idiots as always” also without any proper capitalization), if I start changing my names?
He may not have want to call you G

Very nice and clever of him but just at the level of 4-5 line there were 2 complete names at choice. And any intelligent one would have understood that “G” is just a letter/abbreviation for first name Guennadii but not, in any possible way, “G” is the name. The last name is also found easily after some investigation, it is just the last amongst more than one (Vanin).
The reason that I abbreviated “G” is my experience that latins (peoples to the South of Europe, and South of North America) has enormous difficulties pronouncing and remembering something that should be pronounced as [g] as in goose before “e”, since it never happens in their languages, e.g. (in Portuguese). Then I started adding “Guennadii” underneath since some bartenders here do not understand the difference between an abbreviation, i.e. a letter, and a name, i.e. something with more than one letter. .

"my friend" is a simple friendly greeting

Even if to forget about capitalization of the words, never before did it happen to me in a friendly context or with friendly intentions.
In any language there are more appropriate terms even in situations when the name is unknown.
Usually it is used in sarcastic approaches
[ October 31, 2002: Message edited by: G Vanin ]
Junaid Bhatra
Ranch Hand

Joined: Jun 27, 2000
Posts: 213
To answer some questions:
1) The platform default encoding can be obtained by System.getProperty("file.encoding")
2) Cp1252 is the Windows character set (code page 1252). It's very similar to ISO-8859-1 (Latin 1), but is not identical. For eg, in Latin 1, characters in the range 128-159 are control characters (non-printable). However 1252 assigns some printable characters (such as the TM symbol) to codes within this range. So CP1252 is kind of like a superset to Latin 1.
3) I think there is some confusion here regarding "character sets" & encodings. UTF-8 is simply an encoding scheme (i.e represent characters as a sequence of octets), and is used to encode the Unicode "character set". Character sets on the other hand is a mapping between "charaters" and "character codes"/integers. Additionally a charcter set may specify an encoding scheme(s). For eg, Unicode has UTF-8 and UTF-16.
Also it's important to remember that beyond US-ASCII (i.e character codes > 127), all other encoding schemes (like Latin 1) are incompatible with UTF-8. This is because UTF-8 is a multi-byte encoding scheme which encodes characters like:
One byte 0xxxxxxx - 0 indicates single byte used for encoding the character
2 bytes 110xxxxx 10xxxxxx - 11 in the first byte indicates 2 bytes used for encoding
3 bytes 1110xxxx 10xxxxxx 10xxxxxx - 111 indicates 3 bytes used for encoding the character
On the other hand, Latin 1 encoding is pretty straight-forward....simply write out the character code value (1 byte). Thus for character codes 128 - 255, Latin-1 uses 1 byte for encoding while UTF-8 uses 2 bytes.
To add to the confusion, the "charset" header that is used in HTTP is really the encoding that the web-page employs & not the character set. :roll:
[ October 31, 2002: Message edited by: Junaid Bhatra ]
Jim Yingst
Wanderer
Sheriff

Joined: Jan 30, 2000
Posts: 18671
Even if to forget about capitalization of the words, never before did it happen to me in a friendly context or with friendly intentions.
Well, it has happened to me. Many times. Perhaps that's because you insist on interpreting friendly gestures as unfriendly, until no one bothers trying to be friendly to you? Sure, some people may be being sarcastic when they say "my friend" - but many people are not. Why assume the worst?
That hurts. Where from comes such a terrible and unfair suspicion abt forgering my names?.
No one is suspicious that you forged your name. Cindy simply observed that the name you prefer to be called, is not actually what's listed to the left of your post. It's a mystery to us why you continue to confuse people this way and then complain about it later.
I already tried once, and after that Javaranch adds a �greenhorn� to my names. I�d rather prefer My Friend.
You don't need to register a new account. Just go to "my profile" -> "View/Edit Profile" and change the "Publicly Displayed Name".
Now note that even after you change your name (if you do) it's still possible that someone might call you "my friend" without sarcasm, and it would probably be a good idea if you tried not to be offended for no good reason. You might even realize that people here have in fact been trying to help you, and saying "thank you" occasionally would be nice. Even if your question is not yet answered to your satisfaction, people have been trying. For some reason. :roll:
what are those UTF8 examples all about? Why UTF8?
Sheril chose UTF-8 for his/her example, and I simply continued it. It's a very common encoding, but you can use most any other encoding you want in much the same way. (Assuming it's one that Java understands.)
Michael Matola
whippersnapper
Ranch Hand

Joined: Mar 25, 2001
Posts: 1751
    
    2
Not to turn this into yet another Russian vs. English thread...
My names are already automatically augmented by a “member” in each post.. Ask Russians what translation does mean literally.
"Member" has the exact same dual meanings in English as it has in Russian. Somehow we all survive.
Guennadiy VANIN
Ranch Hand

Joined: Aug 30, 2001
Posts: 898
Junaid and Jim,
thanks for your effort. It is useful for CP-1252. Though I certainly need some link to more exhaustive text to close the subject.
I always imagined any data as bits “0” and “1”, that are eventually integers, in my head (your “octets”?). Everything is integer to me, there, inside PC. There are also such terms as “format”, “template”, “representation”. Both digits and characters are, after all just glyphs, graphical representations according to encodings/formats.
UTF is just the way to save bandwidth, since most of symbols find themselves inside ANSI-ASCII and, therefore, avoid the waste of second byte. :Java uses UNICODE, I believe. And UTF...Honestly speaking, it is in this discussion that I encountered the need to know abt it.
0) I would like to know how it comes to practical use, to be chosen.
1)I had not been aware abt existence of UTF-8 and UTF-16. Any further comments or links?
2) As a matter of fact, I also did not know that the second/third bytes in 2/3-byte sequences start with “10”. What is its (of “10” ;) function (why they may not be arbitrary, if they are to be ignored?=? i.e. why “110/”1110” is not sufficient?
3)
To add to the confusion, the "charset" header that is used in HTTP is really the encoding that the web-page employs & not the character set.

This is a great pain to me. I use access through the library and save-as some pages. And you know, it is strange but at home PC (English versions of WindowsXP) I visualize OK pages in Russian but not in German. But access to Internet, in library, is through MS IE5 in Portuguese. The Windows 2000 are sometimes in English, sometimes in Portuguese. They use CP-1252
For ex., I cannot visualize (german) pages from
http://www.bild.t-online.de
Pages have hundred of KBs but do not show anything. I tried changing encodings in View - no effect.
[ November 06, 2002: Message edited by: G Vanin ]
Guennadiy VANIN
Ranch Hand

Joined: Aug 30, 2001
Posts: 898
Michael,
honestly speaking, I thought I am in Colorado.... communicating with Mexicans. Strange that it is happened to be in England. Can you give me the links to the mentioned threads?
This is the end of discussion, not the beginning. You are late
Jim Yingst
Wanderer
Sheriff

Joined: Jan 30, 2000
Posts: 18671
Something I overlooked earlier...
1) The platform default encoding can be obtained by System.getProperty("file.encoding")
I would note that while this does work on many systems, it's not actually guaranteed by the API. See list under "getProperties()" here.
And more recently...
0) I would like to know how it comes to practical use, to be chosen.
1)I had not been aware abt existence of UTF-8 and UTF-16. Any further comments or links?

Well, "UTF" by itself is ambiguous - it may refer to either UTF-8 or UTF-16. (Or other more obscure variants.) Basically UTF-8 is designed as a reasonably simple encoding which is efficient for western european languages - typically requiring only 1 byte for most characters. The down sides are that since it's variable-length it's more complex to parse, and most Asian languages end up requiring 3 bytes per char. UTF-16 on the other hand is simpler, and all chars require two bytes. (Unless you want to use Unicode values above 0xFFFF, which are almost never needed by most of us and require more complex handling in Java, which I don't understand very well.) So generally UTF-8 is preferred in the west, and UTF-16 in asian countries. (Or some other encoding more specific to a given language, like Shift-JIS for Japanese.)
On reflection, maybe I should have just provided these two links:
http://www.wikipedia.org/wiki/UTF-8
http://www.wikipedia.org/wiki/UTF-16
I'm not sure if that's what you we asking, but you can always Google for more links.
2) As a matter of fact, I also did not know that the second/third bytes in 2/3-byte sequences start with �10�. What is its (of �10� function (why they may not be arbitrary, if they are to be ignored?=? i.e. why �110/�1110� is not sufficient?
It's an easy way to tell if a particular byte is the start of a multibyte sequence or not. If you're writing an encoder or decoder and you commit some sort of off-by-one-byte error, it's easier to detect the error this way.
To add to the confusion, the "charset" header that is used in HTTP is really the encoding that the web-page employs & not the character set.
Similarly java.nio.charset.Charset in JDK 1.4 really represents an encoding, not a character set.
http://www.bild.t-online.de
Pages have hundred of KBs but do not show anything. I tried changing encodings in View - no effect.

The site comes up fine for me - encoding us ISO-8859-1. I suspect this has nothing to do with encoding issues, but rather with the fact that you're accessing from a library in the US. The website is a little racy by US library standards - it has some nudity (even if that's not the primary focus), and so the library probably has it blocked. You may be interested in David O'Meara's suggestion in this thread.
[ November 06, 2002: Message edited by: Jim Yingst ]
Guennadiy VANIN
Ranch Hand

Joined: Aug 30, 2001
Posts: 898
Jim,
thank for all that stuff (I saved-as and shall study later).
Meanwhile that problem of visualizing after save-as reproduced by others, see in "HTML and Javascript"
http://www.coderanch.com/t/113280/HTML-JavaScript/Cannot-open-pages-After-saving
so, it is weird
paul wheaton
Trailboss

Joined: Dec 14, 1998
Posts: 20660
    ∞

Some people seem to find the best possible interpretation of a message and are happy to get any response to a question.
Some people seem to find insult and injury in almost any message.
Some people seem to have fun and have a good time wherever they go.
Some people seem to be cranky all day long, every day.
There are over six billion people in the world. There's no reason to spend any time talking to cranky people.


permaculture Wood Burning Stoves 2.0 - 4-DVD set
Guennadiy VANIN
Ranch Hand

Joined: Aug 30, 2001
Posts: 898
Paul,
that's a challenge: to talk to 6 billion
 
Consider Paul's rocket mass heater.
 
subject: any hints for creating &/or using existing UNICODE convertor/processor?