aspose file tools*
The moose likes Developer Certification (SCJD/OCMJD) and the fly likes size in bytes vs. string length Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Certification » Developer Certification (SCJD/OCMJD)
Bookmark "size in bytes vs. string length" Watch "size in bytes vs. string length" New topic
Author

size in bytes vs. string length

Gytis Jakutonis
Ranch Hand

Joined: Feb 02, 2004
Posts: 76
Hello,
is it possible to calculate max string length by having max string size in bytes. As bytes count depends on encoding, I'm not sure how client side can validate input values(length in this case) without knowledge of db file encoding(which breaks all 2 or 3-tier logic). Seems like many developers takes ASCII encoding(e.g. it can not be changed) and uses size in bytes to validate string length. Any comments? Thanks
Philippe Maquet
Bartender

Joined: Jun 02, 2003
Posts: 1872
Hi Gytis,
With the encoding scheme used in our assignment (US-ASCII), there is a fixed 1/1 ratio between characters and bytes. Hence you may safely use String.length() to validate input values as far as field sizes are concerned.
Now what would happen in the case, they'd change the file's encoding scheme and get a fixed 1/2 ratio between characters and bytes? The file would have to be converted anyway (its size being multiplied by 2), meaning that your previous use of String.length() still would be valid.
Regards,
Phil.
Baruch Sadogursky
Ranch Hand

Joined: Apr 09, 2002
Posts: 62
The really interesting thing happen, when using UTF-8 encoding. In UTF-8 the chars can be both one or two bytes, so multiply by 2 won't work.


Regards,<br />Baruch.<p>SDFWOF<br />FGEHWS<br />FNEVGE
Philippe Maquet
Bartender

Joined: Jun 02, 2003
Posts: 1872
That's why I talked about a fixed ratio. UTF-8 in one of the encoding schemes which use a variable ratio between characters and bytes, hence cannot be used to encode files made of fixed-length records. UTF-8 is just an example, but there are many other encodings which couldn't be supported for the same reason.
Regards,
Phil.
Gytis Jakutonis
Ranch Hand

Joined: Feb 02, 2004
Posts: 76
Hi Philippe,
thanks for your replay. Seems like you are assuming that field size, given in db file, represents actual string field length, but not the size in bytes(as it is stated in assignment document). In this case changing charset encoding shouldn't harm the system, i.e. validation like this one will work:

But that if field size with UTF encoding will be doubled too? Then that validation will definitelly fail. This is my main concern.
[Phil:corrected the "code" tags]
[ April 20, 2004: Message edited by: Philippe Maquet ]
Philippe Maquet
Bartender

Joined: Jun 02, 2003
Posts: 1872
Hi Gytis,
Seems like you are assuming that field size, given in db file, represents actual string field length, but not the size in bytes(as it is stated in assignment document).

As our assignment also states that the character encoding is "US-ASCII", field sizes expressed in characters equal field sizes expressed in bytes.
But that if field size with UTF encoding will be doubled too? Then that validation will definitelly fail. This is my main concern.

I believe that supporting charsets where characters are encoded on 2 bytes is out of scope for this assignment. If *you* believe that either (and justify it ), your test below is OK:

Now if you want to abstract a bit the conversion between characters and bytes as far field lengths are concerned, you *may* code it as:

where BYTES_PER_CHAR is currently 1 but could could be 2 in the future without breaking your test.
BTW, I implicitely suggested a constant, but it could be a variable as well, the ratio being computed.
Regards,
Phil.
 
It is sorta covered in the JavaRanch Style Guide.
 
subject: size in bytes vs. string length