File APIs for Java Developers
Manipulate DOC, XLS, PPT, PDF and many others from your application.
http://aspose.com/file-tools
The moose likes Java in General and the fly likes convert uft-8 into ascii format Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Java » Java in General
Bookmark "convert uft-8 into ascii format" Watch "convert uft-8 into ascii format" New topic
Author

convert uft-8 into ascii format

Harathi Rao
Ranch Hand

Joined: Oct 31, 2004
Posts: 42
Hi All,

I have a string which is in UTF-8 format, the requirement is to convert this string to ASCII format before passing it to a database.

any help is a welcome

Thanks
Peter Chase
Ranch Hand

Joined: Oct 30, 2001
Posts: 1970
Java Strings are always in UTF-16, so I guess you do not have a String in UTF-8, but instead have some bytes in UTF-8. If you have a String and you think it's in UTF-8, then something has gone wrong somewhere in the design, I reckon.

You can make a String from your bytes, using the constructor String(byte[] bytes, String charSetName). Pass "UTF-8" as the charSetName.

You can then write your string into bytes using a different encoding, via the method of String called getBytes(String charSetName).

If you really want true 7-bit ASCII, be aware that many Unicode characters simply cannot be represented. Also, you might be able to shortcut the above procedure, because UTF-8 has 7-bit ASCII as a subset.


Betty Rubble? Well, I would go with Betty... but I'd be thinking of Wilma.
Edwin Dalorzo
Ranch Hand

Joined: Dec 31, 2004
Posts: 961
One way to do it would be:



You coud also use the CharsetEncoder and CharsetDecoder classes.
Peter Chase
Ranch Hand

Joined: Oct 30, 2001
Posts: 1970
Originally posted by Edwin Dalorzo:
One way to do it would be:




Er, what? Your string "utf" is UTF-16 encoded, like all Java strings. You can't have a Java String that is UTF-8 encoded. A stream of bytes can be UTF-8 encoded, but not a String.
Scott Johnson
Ranch Hand

Joined: Aug 24, 2005
Posts: 518
You didn't mention which database and JDBC driver you are using.

The drivers that I'm familiar with will do any conversions necessary to store Java String objects into the database in the correct character encoding.

If your database is 8 bit ASCII, the driver should handle all conversions from UTF-16 to ASCII.
Edwin Dalorzo
Ranch Hand

Joined: Dec 31, 2004
Posts: 961
Hi, Peter.

You continue to say that Java String are always in UTF-16. Why do you say that?

The encoding of the Java Strings is determined by the default encoding used by the JVM, declared in the file.encoding property.

Another thing very different is the encoding of the Java files (*.java) which might be UTF. But doest not have anything to do with you your application.

Strings are encoded according to every particular environment and you can just as easily convert a string from one encoding to the other. The String class provides methods for such purposes as well as java.nio.charset package.

So, Peter, how come you say all String in Java are UTF-16?

The example that I wrote is a way to convert a String from whatever format it is into ASCII format. I assumed tha format is UTF-X, not implying by this that is always the case.

Another option to convert a String from one enconding to another is the use of java.nio.charset package by means of using the Encoder and Decoder classes.
pascal betz
Ranch Hand

Joined: Jun 19, 2001
Posts: 547
Edwin,


the String(byte[]) constructor creates a String object assuming the bytes are in the default platform encoding! so depending on your encoding the string might get messed up. And if you call the getBytes() method again you will not get back ASCII bytes but... plattform default encoding.

UTF-16
i think what Peter is referring to is the "internal encoding", the encoding which is used by the JVM (Peter ?).

String API (1.5):
A String represents a string in the UTF-16 format ...


pascal
Paul Clapham
Bartender

Joined: Oct 14, 2005
Posts: 18541
    
    8

Originally posted by Edwin Dalorzo:
Hi, Peter.

You continue to say that Java String are always in UTF-16. Why do you say that?

The encoding of the Java Strings is determined by the default encoding used by the JVM, declared in the file.encoding property.

Another thing very different is the encoding of the Java files (*.java) which might be UTF. But doest not have anything to do with you your application.

Strings are encoded according to every particular environment and you can just as easily convert a string from one encoding to the other. The String class provides methods for such purposes as well as java.nio.charset package.

So, Peter, how come you say all String in Java are UTF-16?

The example that I wrote is a way to convert a String from whatever format it is into ASCII format. I assumed tha format is UTF-X, not implying by this that is always the case.

Another option to convert a String from one enconding to another is the use of java.nio.charset package by means of using the Encoder and Decoder classes.
Sorry, but this is all totally incorrect. Peter is right, all Java Strings are sequences of chars, and all Java chars are Unicode code-points in UTF-16. (Before Unicode 4.0 it was simpler, a char was just a Unicode character.)

An encoding is a method of converting between a Java String (which consists of chars) and an array of bytes. The String.getBytes(encoding) method maps from chars to bytes, and the new String(bytes, encoding) constructor maps from bytes to chars.

Sometimes people are sloppy and start talking about "UTF-8 strings" when they really have an array of bytes that was encoded using UTF-8, or perhaps a String that was decoded from an array of bytes using UTF-8. But that's misleading and incorrect.

You may not have realized that a file is also an array of bytes. So a Reader converts those bytes into chars, and a Writer converts chars into bytes. You're correct that the default encoding used by Readers and Writers comes from the file.encoding property; you can use different encodings by using an InputStreamReader or an OutputStreamWriter and specifying the encoding.

Likewise the data you get from a socket connection is a stream of bytes; if it is text then it can be converted into String data using some encoding.

If you go to the documentation for that java.nio.charset package that you referred to, you will see it says what I just said. For example a Charset is "A named mapping between sequences of sixteen-bit Unicode characters and sequences of bytes." Its encode() method is "Convenience method that encodes Unicode characters into bytes in this charset." Its decode() method is "Convenience method that decodes bytes in this charset into Unicode characters." There's no string-to-string conversion going on at all. Because Strings don't have encodings. Only sequences of bytes do.
Edwin Dalorzo
Ranch Hand

Joined: Dec 31, 2004
Posts: 961
I did not know that Peter and Paul.

Thanks for the clarification. I guess I will have to do some research about it.

Thanks!
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: convert uft-8 into ascii format