• Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

UTF-8 string length

 
Dan Drillich
Ranch Hand
Posts: 1183
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
The following code returns 1.
How can I find out how many bytes are actually used to store the UTF-8 string?
Thanks,
Dan
public class UTF8StrLen {
public static void main(String args[]) {
String tst = "\u6394";
System.out.println(tst.length());

}
}
 
James Hobson
Ranch Hand
Posts: 140
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
the length method on String returns the number of characters.
Luckily, Java treats all Strings as 16bit Unicode, so the memory space occupied is trivial to calculate (multiply the number of characters by 2).
This is only really relevent when using a UTF IO stream .
If you really wanted, you can use
public byte[] getBytes(String´┐Żenc)
and then inspect the bytes to see if they were convered to UTF8 how many bytes they would occupy by testing their magnitude.
Why would you want to know?
 
Michael Fitzmaurice
Ranch Hand
Posts: 168
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi Dan
A couple of things - firstly, character encoding inside the JVM is always done using unicode, not UTF8. UTF is used by Java outside the JVM. Do u really want to know how many bytes were used to encode that String in UTF8, or do you want to know how many bytes were required using unicode? If its the former, the code below gives an idea of how this could be done.
Secondly, the String you are creating ('tst') appears to use something close to the syntax for a literal char in unicode, e.g. <code>char tst = '\u6394';</code>. Are you really trying to find out the number of bytes required for this String? Did you in fact just want a char? If you did want a char, it would be stored inside the JVM using 2 bytes, like all Java unicode characters.
The code you are using returns 1 because the String.length() method counts the characters in a String, not the number of bytes. The particular literal value you are using appears to be interpreted as a literal unicode char, so you have a String made up of one character.
Code for the UTF8 question - if this is not what u were trying to do, get back to me:
<code>
<pre>
import java.io.*;

public class UTF8StrLen
{
public static void main(String[] args)throws IOException
{
String tst = "\u6394";
ByteArrayOutputStream bytesOut = new ByteArrayOutputStream();
OutputStreamWriter out = new OutputStreamWriter(bytesOut, "UTF8");
out.write(tst);
out.flush();
byte[] tstBytes = bytesOut.toByteArray();
int size = tstBytes.length;
System.out.println(size + " bytes used to store the String in UTF8");
out.close();
bytesOut.close();
}
}
</pre>
</code>
------------------
"One good thing about music - when it hits, you feel no pain"
Bob Marley
 
Jim Yingst
Wanderer
Sheriff
Posts: 18671
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I had this problem recently. Our database is configured for UTF-8 - apparently this means that if a varchar is declared with length 2000, then 2000 bytes are reserved for it as a UTF-8 byte sequence. So before you try to insert a String into this field, you may want to know how many bytes it will occupy. James' solution is easiest to code. For very large strings though it can be needlessly inefficient to create and populate entire new byte array when all you want is the length. So I used an OutputStreamWriter wrapped around a custom OutputStream which does nothing except increment a counter each time the write(int) method is called, indicating that one byte would be written. Afterwards I get the counter's value - that's the length in bytes for the string.
Of course, for very large strings, it may also be needleesly inefficient to create the whole String in memory at once. Using Readers and Writers to pass the characters around without instantiating a String is a useful approach here, and the above OutputStreamWriter fits in well here.
 
Jim Yingst
Wanderer
Sheriff
Posts: 18671
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I wrote the above before seeing Michael's reply - an interesting hybrid of my method and James'. But if you're going to end up creating a full byte array anyway, then James' str.getBytes(enc).length is the easist way to do this.
 
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic