aspose file tools*
The moose likes Developer Certification (SCJD/OCMJD) and the fly likes Mark: default encoding for deprecated methods Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Certification » Developer Certification (SCJD/OCMJD)
Bookmark "Mark: default encoding for deprecated methods" Watch "Mark: default encoding for deprecated methods" New topic
Author

Mark: default encoding for deprecated methods

Michael Morris
Ranch Hand

Joined: Jan 30, 2002
Posts: 3451
Mark,
I believe I read somewhere that you kinda went against the grain and used default enconding instead of UTF8. That's what I want to do but am not sure how to defend it. As someone else pointed out UTF8 can get sticky on fixed fields if you have some weird character(s) that require(s) extra bytes. Of course that begs the question why would you have such characters in the db for this assignment in the first place?
As always your help is greatly appreciated
Michael Morris


Any intelligent fool can make things bigger, more complex, and more violent. It takes a touch of genius - and a lot of courage - to move in the opposite direction. - Ernst F. Schumacher
Mark Spritzler
ranger
Sheriff

Joined: Feb 05, 2001
Posts: 17259
    
    6

I used the default, since the default is UTF-8, and since everyone was using it, I thought why even have to type it in when that is what it is already.
Mark


Perfect World Programming, LLC - Two Laptop Bag - Tube Organizer
How to Ask Questions the Smart Way FAQ
Michael Morris
Ranch Hand

Joined: Jan 30, 2002
Posts: 3451
Thanks Mark. Good point.
Michael Morris
Adam Caldwell
Greenhorn

Joined: Mar 27, 2002
Posts: 17
I used US-ASCII and defended it as a design decision. The key points to consider are:
1. What's the existing data in the DB.
2. What if some international character is entered (especially near the edge of the field length limit).
3. How would you fix it to handle International characters properly (ie field length is used as a byte count which is a Bad Idea(tm)).
Michael Morris
Ranch Hand

Joined: Jan 30, 2002
Posts: 3451
Hi Adam,
I'm certainly not trying to dispute what you've done nor say that it's wrong since I am no expert on UNICODE, but considering that the Data class constructors use readUTF() and writeUTF() to read/write the schema, why not use the same enconding when reading and writing the records?
I appreciate your help and opinion.
Michael Morris
[ March 27, 2002: Message edited by: Michael Morris ]
Adam Caldwell
Greenhorn

Joined: Mar 27, 2002
Posts: 17
The field length is a fixed number of bytes based on FieldInfo.length(). Unicode is 2 bytes per character. UTF-8 is 1-3 bytes per character.
Now the code to write the data to the file says:
space = description[i].getLength();
size = newData[i].length();
toCopy = (size <= space) ? size : space;
The toCopy is a number of BYTEs, not characters.
So when you do the newData[i].getBytes(0, toCopy, buffer, offset); (or whatever you replace this with), it ends up copying that many bytes.
Now, if the code you replace it with uses UTF-8 and the string you're trying to copy ends up being newData[i].length() characters, but the last character is a character that requires a multi-byte encoding, you'll end up truncating that final character into something that isn't valid in UTF-8 when you try to read it back in.
Does that make sense?
BTW, you should document that there is a bug in that code anyway... It needs to store the length otherwise when you read the string back in, its possible you will loose real infomation... The standard "fix" people say is to add .trim(), but what if the data contains trailing white space that is supposed to be there? What is the correct way of fixing it?
-Adam
Burk Hufnagel
Ranch Hand

Joined: Oct 01, 2001
Posts: 814
    
    3
Generally, I hate answering a question with a question, but can you give me a reason why the trailing white space might be important?
If you display a field containing trailing white space to the user and they copy what they see as search criteria then the search will fail since they can't see the spaces - and probably can't type in whatever code is necessary to generate they control characters that show up as a box. Even if they can enter a code, how woyuld they know which code-that-is-displayed-as-a-box to enter since there may be more than one?
I tend to think that trailing whitespace is noise that should be removed. The only time it might be important is in a fixed width field - in which case I'd strip it off when using the data and pad with spaces (if needed) only when updating the field's contents in the database.
Burk
Originally posted by Adam Caldwell:
BTW, you should document that there is a bug in that code anyway... It needs to store the length otherwise when you read the string back in, its possible you will loose real infomation... The standard "fix" people say is to add .trim(), but what if the data contains trailing white space that is supposed to be there? What is the correct way of fixing it?
-Adam


SCJP, SCJD, SCEA 5 "Any sufficiently analyzed magic is indistinguishable from science!" Agatha Heterodyne (Girl Genius)
Adam Caldwell
Greenhorn

Joined: Mar 27, 2002
Posts: 17
I personally think that any database that doesn't give me back what I put into it is inherently broken.
That being said, I cheezed out on the assignment and just used .trim()
-Adam
Michael Morris
Ranch Hand

Joined: Jan 30, 2002
Posts: 3451
Hi Adam,
I appreciate the feedback, but coming from a UNIX command-line background I have always trimmed whitespace from parsed data. Why would external whitespace be significant in a database or any input? I think I'm going to go with the majarotity here and stick with UTF-8. Actually I'm going with the default for the deprecated methods which as Mark pointed out is UTF-8. I will simply document the remote possibility of data loss from certain characters which, for this assignment will probably never be used.
Michael Morris
Adam Caldwell
Greenhorn

Joined: Mar 27, 2002
Posts: 17
I come from the unix command line world too. I
expect that if I put on the command line 'foo ' that the trailing space will be preserved. If I put foo bar, I expect the shell to get rid of the extra space.
Here's a real world example though. Say you were storing paragraphs of a book in a database, and you put in " This is the first paragraph." You would want those leading spaces preserved wouldn't you? Well .trim() removes spaces from both ends of the string.
And btw, the encoding used in the deprecated method is NOT UTF-8. It is stripping the high byte off and replacing it with 0. That particular mapping is the same as UTF-8 for any character less than 127... which just so happens to be the same as for US-ASCII.
Either answer is correct for reading the stuff back in, the question becomes what happens when you write stuff out. Under the US-ASCII encoding, any non-ASCII characters get converted to ?, under UTF-8, you get the multiple byte encoding problem.
I personally don't like either solution, and I said as much in my design decisions, noting that I could not fix the problem given the requirements.
-Adam
Michael Morris
Ranch Hand

Joined: Jan 30, 2002
Posts: 3451
Hi Adam,
Good point on the ASCII encoding. I agree there are indeed cases where the preservation of leading and trailing whitespace is a necessity. But those instances are generally boundary cases and can be dealt with when required. Anyway I appreciate your help on this and you may yet convince me to use ASCII. I'm not in love with any of the options. I can remember when life was so easy, when the only options were EBCDIC (how many here can remember this one?) and ASCII. The down side was disco was alive and well!
Thanks again Adam,
Michael Morris
 
 
subject: Mark: default encoding for deprecated methods