• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Jeanne Boyarsky
  • Ron McLeod
  • Paul Clapham
  • Liutauras Vilda
Sheriffs:
  • paul wheaton
  • Rob Spoor
  • Devaka Cooray
Saloon Keepers:
  • Stephan van Hulst
  • Tim Holloway
  • Carey Brown
  • Frits Walraven
  • Tim Moores
Bartenders:
  • Mikalai Zaikin

Mark: default encoding for deprecated methods

 
Ranch Hand
Posts: 3451
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Mark,
I believe I read somewhere that you kinda went against the grain and used default enconding instead of UTF8. That's what I want to do but am not sure how to defend it. As someone else pointed out UTF8 can get sticky on fixed fields if you have some weird character(s) that require(s) extra bytes. Of course that begs the question why would you have such characters in the db for this assignment in the first place?
As always your help is greatly appreciated
Michael Morris
 
ranger
Posts: 17347
11
Mac IntelliJ IDE Spring
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I used the default, since the default is UTF-8, and since everyone was using it, I thought why even have to type it in when that is what it is already.
Mark
 
Michael Morris
Ranch Hand
Posts: 3451
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Thanks Mark. Good point.
Michael Morris
 
Greenhorn
Posts: 17
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I used US-ASCII and defended it as a design decision. The key points to consider are:
1. What's the existing data in the DB.
2. What if some international character is entered (especially near the edge of the field length limit).
3. How would you fix it to handle International characters properly (ie field length is used as a byte count which is a Bad Idea(tm)).
 
Michael Morris
Ranch Hand
Posts: 3451
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Hi Adam,
I'm certainly not trying to dispute what you've done nor say that it's wrong since I am no expert on UNICODE, but considering that the Data class constructors use readUTF() and writeUTF() to read/write the schema, why not use the same enconding when reading and writing the records?
I appreciate your help and opinion.
Michael Morris
[ March 27, 2002: Message edited by: Michael Morris ]
 
Adam Caldwell
Greenhorn
Posts: 17
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
The field length is a fixed number of bytes based on FieldInfo.length(). Unicode is 2 bytes per character. UTF-8 is 1-3 bytes per character.
Now the code to write the data to the file says:
space = description[i].getLength();
size = newData[i].length();
toCopy = (size <= space) ? size : space;
The toCopy is a number of BYTEs, not characters.
So when you do the newData[i].getBytes(0, toCopy, buffer, offset); (or whatever you replace this with), it ends up copying that many bytes.
Now, if the code you replace it with uses UTF-8 and the string you're trying to copy ends up being newData[i].length() characters, but the last character is a character that requires a multi-byte encoding, you'll end up truncating that final character into something that isn't valid in UTF-8 when you try to read it back in.
Does that make sense?
BTW, you should document that there is a bug in that code anyway... It needs to store the length otherwise when you read the string back in, its possible you will loose real infomation... The standard "fix" people say is to add .trim(), but what if the data contains trailing white space that is supposed to be there? What is the correct way of fixing it?
-Adam
 
Ranch Hand
Posts: 883
3
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Generally, I hate answering a question with a question, but can you give me a reason why the trailing white space might be important?
If you display a field containing trailing white space to the user and they copy what they see as search criteria then the search will fail since they can't see the spaces - and probably can't type in whatever code is necessary to generate they control characters that show up as a box. Even if they can enter a code, how woyuld they know which code-that-is-displayed-as-a-box to enter since there may be more than one?
I tend to think that trailing whitespace is noise that should be removed. The only time it might be important is in a fixed width field - in which case I'd strip it off when using the data and pad with spaces (if needed) only when updating the field's contents in the database.
Burk

Originally posted by Adam Caldwell:
BTW, you should document that there is a bug in that code anyway... It needs to store the length otherwise when you read the string back in, its possible you will loose real infomation... The standard "fix" people say is to add .trim(), but what if the data contains trailing white space that is supposed to be there? What is the correct way of fixing it?
-Adam

 
Adam Caldwell
Greenhorn
Posts: 17
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I personally think that any database that doesn't give me back what I put into it is inherently broken.
That being said, I cheezed out on the assignment and just used .trim()
-Adam
 
Michael Morris
Ranch Hand
Posts: 3451
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Hi Adam,
I appreciate the feedback, but coming from a UNIX command-line background I have always trimmed whitespace from parsed data. Why would external whitespace be significant in a database or any input? I think I'm going to go with the majarotity here and stick with UTF-8. Actually I'm going with the default for the deprecated methods which as Mark pointed out is UTF-8. I will simply document the remote possibility of data loss from certain characters which, for this assignment will probably never be used.
Michael Morris
 
Adam Caldwell
Greenhorn
Posts: 17
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I come from the unix command line world too. I
expect that if I put on the command line 'foo ' that the trailing space will be preserved. If I put foo bar, I expect the shell to get rid of the extra space.
Here's a real world example though. Say you were storing paragraphs of a book in a database, and you put in " This is the first paragraph." You would want those leading spaces preserved wouldn't you? Well .trim() removes spaces from both ends of the string.
And btw, the encoding used in the deprecated method is NOT UTF-8. It is stripping the high byte off and replacing it with 0. That particular mapping is the same as UTF-8 for any character less than 127... which just so happens to be the same as for US-ASCII.
Either answer is correct for reading the stuff back in, the question becomes what happens when you write stuff out. Under the US-ASCII encoding, any non-ASCII characters get converted to ?, under UTF-8, you get the multiple byte encoding problem.
I personally don't like either solution, and I said as much in my design decisions, noting that I could not fix the problem given the requirements.
-Adam
 
Michael Morris
Ranch Hand
Posts: 3451
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Hi Adam,
Good point on the ASCII encoding. I agree there are indeed cases where the preservation of leading and trailing whitespace is a necessity. But those instances are generally boundary cases and can be dealt with when required. Anyway I appreciate your help on this and you may yet convince me to use ASCII. I'm not in love with any of the options. I can remember when life was so easy, when the only options were EBCDIC (how many here can remember this one?) and ASCII. The down side was disco was alive and well!
Thanks again Adam,
Michael Morris
 
Hug your destiny! And hug this tiny ad:
a bit of art, as a gift, the permaculture playing cards
https://gardener-gift.com
reply
    Bookmark Topic Watch Topic
  • New Topic