This week's book giveaway is in the OO, Patterns, UML and Refactoring forum.
We're giving away four copies of Refactoring for Software Design Smells: Managing Technical Debt and have Girish Suryanarayana, Ganesh Samarthyam & Tushar Sharma on-line!
See this thread for details.
The moose likes Beginning Java and the fly likes UTF-8  string length Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login

JavaRanch » Java Forums » Java » Beginning Java
Bookmark "UTF-8  string length" Watch "UTF-8  string length" New topic

UTF-8 string length

Dan Drillich
Ranch Hand

Joined: Jul 09, 2001
Posts: 1183
The following code returns 1.
How can I find out how many bytes are actually used to store the UTF-8 string?
public class UTF8StrLen {
public static void main(String args[]) {
String tst = "\u6394";


William Butler Yeats: All life is a preparation for something that probably will never happen. Unless you make it happen.
James Hobson
Ranch Hand

Joined: Aug 28, 2001
Posts: 140
the length method on String returns the number of characters.
Luckily, Java treats all Strings as 16bit Unicode, so the memory space occupied is trivial to calculate (multiply the number of characters by 2).
This is only really relevent when using a UTF IO stream .
If you really wanted, you can use
public byte[] getBytes(String�enc)
and then inspect the bytes to see if they were convered to UTF8 how many bytes they would occupy by testing their magnitude.
Why would you want to know?
Michael Fitzmaurice
Ranch Hand

Joined: Aug 22, 2001
Posts: 168
Hi Dan
A couple of things - firstly, character encoding inside the JVM is always done using unicode, not UTF8. UTF is used by Java outside the JVM. Do u really want to know how many bytes were used to encode that String in UTF8, or do you want to know how many bytes were required using unicode? If its the former, the code below gives an idea of how this could be done.
Secondly, the String you are creating ('tst') appears to use something close to the syntax for a literal char in unicode, e.g. <code>char tst = '\u6394';</code>. Are you really trying to find out the number of bytes required for this String? Did you in fact just want a char? If you did want a char, it would be stored inside the JVM using 2 bytes, like all Java unicode characters.
The code you are using returns 1 because the String.length() method counts the characters in a String, not the number of bytes. The particular literal value you are using appears to be interpreted as a literal unicode char, so you have a String made up of one character.
Code for the UTF8 question - if this is not what u were trying to do, get back to me:

public class UTF8StrLen
public static void main(String[] args)throws IOException
String tst = "\u6394";
ByteArrayOutputStream bytesOut = new ByteArrayOutputStream();
OutputStreamWriter out = new OutputStreamWriter(bytesOut, "UTF8");
byte[] tstBytes = bytesOut.toByteArray();
int size = tstBytes.length;
System.out.println(size + " bytes used to store the String in UTF8");
"One good thing about music - when it hits, you feel no pain"
Bob Marley

"One good thing about music - when it hits, you feel no pain" <P>Bob Marley
Jim Yingst

Joined: Jan 30, 2000
Posts: 18671
I had this problem recently. Our database is configured for UTF-8 - apparently this means that if a varchar is declared with length 2000, then 2000 bytes are reserved for it as a UTF-8 byte sequence. So before you try to insert a String into this field, you may want to know how many bytes it will occupy. James' solution is easiest to code. For very large strings though it can be needlessly inefficient to create and populate entire new byte array when all you want is the length. So I used an OutputStreamWriter wrapped around a custom OutputStream which does nothing except increment a counter each time the write(int) method is called, indicating that one byte would be written. Afterwards I get the counter's value - that's the length in bytes for the string.
Of course, for very large strings, it may also be needleesly inefficient to create the whole String in memory at once. Using Readers and Writers to pass the characters around without instantiating a String is a useful approach here, and the above OutputStreamWriter fits in well here.

"I'm not back." - Bill Harding, Twister
Jim Yingst

Joined: Jan 30, 2000
Posts: 18671
I wrote the above before seeing Michael's reply - an interesting hybrid of my method and James'. But if you're going to end up creating a full byte array anyway, then James' str.getBytes(enc).length is the easist way to do this.
I’ve looked at a lot of different solutions, and in my humble opinion Aspose is the way to go. Here’s the link:
subject: UTF-8 string length
It's not a secret anymore!