wood burning stoves 2.0*
The moose likes Developer Certification (SCJD/OCMJD) and the fly likes NX: URLyBird 1.2.1 - Character encoding Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Certification » Developer Certification (SCJD/OCMJD)
Bookmark "NX: URLyBird 1.2.1 - Character encoding" Watch "NX: URLyBird 1.2.1 - Character encoding" New topic
Author

NX: URLyBird 1.2.1 - Character encoding

Philippe Maquet
Bartender

Joined: Jun 02, 2003
Posts: 1872
Hi everybody,
We recently discussed character encoding in thread NX URLyBird 1.3.2: Extracting null terminated strings from ByteBuffer. As it was a little off-topic there, I come back here with the character encoding issue.
To sum up :
  • Jim Yingst favored putting the charset name in a constant and use the Charset class to encode() and decode() field values.

  • I favored putting the charset name in a variable and I used the String.getBytes(charsetName) method to encode and the String constructor which accepts a charsetName as argument to decode.


  • In the meantime, I followed Jim's advice and I also now use the Charset class.
    But I still prefer the variable compared with the constant : the only advantage to put in a constant is to have only one easy-place-to-find if you need to change it. But if the case arises, it's hihgly probable that you'll need to convert an existing database file. In my design, Data offers all the tools you need to convert a data file from one encoding to another, while with a constant you need to write a separate conversion application. Here is how Data could be used to make such a conversion :

    But I am still stuck with the Character encoding because :
  • I now realize that my instructions seem to be unclear
  • I must still decide which encoding I may support



  • 1� My instructions say :
    The character encoding is 8 bit US ASCII.

    What is "8 bit US ASCII" ? In Charset doc, I read :
    US-ASCII Seven-bit ASCII, a.k.a. ISO646-US, a.k.a. the Basic Latin block of the Unicode character set

    7 bits ! And indeed, when I test "US-ASCII", french accented characters are not recognized. How to interpret their "8 bit US ASCII" instruction then ?
    2� Which encodings may we support ?
    I think that the tests done in Charset.forName() are not sufficient. As the file format is based on fixed-length records, there are supported encodings we may not use : the ones which have a non constant number_of_bytes/number_of_characters ratio (UTF-8 for example).
    It's possible to filter the available charsets to only retain those which have a fixed 2/1 or 1/1 such ratio, but isn't it going too far in this assignment ?
    Thanks in advance for your comments, I am stuck here !
    Regards,
    Phil.
    Jim Yingst
    Wanderer
    Sheriff

    Joined: Jan 30, 2000
    Posts: 18671
    I have no objection to using a variable rather than a constant - I chose the constant as the simplest option since the current application isn't required to support multiple encondings, but if you want to offer that option, go for it. (Especially since you're in Europe and the limitations of US-ASCII are more annoying for you than for me.)
    I think that 7-bit US-ASCII and 8-bit US-ASCII are the same thing, at least as referenced in Charset. US-ASCII only uses bits 0-6; bit 7 is there, but it's always 0. We need only 7 bits, but use 8 anyway. It would be possible to send US-ASCII using only 7 bits per char (or 7 bytes per 8 chars) but that isn't what Charset does.
    Essentially by saying US-ASCII the instructions are saying "the DB file does not support funny European characters". Even though it would be easy to do so. :roll:
    As the file format is based on fixed-length records, there are supported encodings we may not use : the ones which have a non constant number_of_bytes/number_of_characters ratio (UTF-8 for example).
    Or any ratio other than 1, I think. E.g. UTF-16 is also out, unless you want to do additional work to figure out how to validate the length of an input to make sure its encoded length is within the availailable space. It can be done, but it's more conceptual overhead. E.g. users may be confused why they were normally allowed 16 chars for one field, but when they sitch to UTF-16 it's 8, and when they use UTF-8 and insert an é they can only have 15. Easiest to just require that the encoding is 1-per-char, I think.
    It's possible to filter the available charsets to only retain those which have a fixed 2/1 or 1/1 such ratio, but isn't it going too far in this assignment ?
    Seems like too much work to me, but if you do want the charset to be user-configurable you kind of need to do this. Or you could provide a preset list of approved charsets to choose from.


    "I'm not back." - Bill Harding, Twister
    Philippe Maquet
    Bartender

    Joined: Jun 02, 2003
    Posts: 1872
    Hi Jim,
    Thanks Jim for your reply, it really helped me.
    Seems like too much work to me, but if you do want the charset to be user-configurable you kind of need to do this.

    I agree : it's typically a "all or nothing" stuff.
    The only things in the instructions which may motivate me are :
  • US ASCII encoding is explicitly quoted in the instructions. It means that they are "encoding-aware", and that it's not foolish to imagine some future change.
  • After their java first-try, they intend to move the application to the web, and I've heard that there are plenty of funny characters out there (asian are even more funny for us than european ones).
  • Thanks to its Unicode internal representation of strings, Java is the platform of choice to handle foreign characters (I had to handle them in Delphi so I am very conscious of the easiness Java offers).
  • I've spent so much time to discover and understand the technologies implied, that the few lines of code I need to write to implement them seem now a no-cost effort.


  • E.g. UTF-16 is also out, unless you want to do additional work to figure out how to validate the length of an input to make sure its encoded length is within the availailable space. It can be done, but it's more conceptual overhead. E.g. users may be confused why they were normally allowed 16 chars for one field, but when they sitch to UTF-16 it's 8, and when they use UTF-8 and insert an � they can only have 15. Easiest to just require that the encoding is 1-per-char, I think.

    From the client point of view, field lengths must stay what they are : implicitly expressed in characters. If you you use a "2 bytes/1 character ratio" encoding, it's not a big deal to transparently handle conversions between "byte lengths" and "characters lengths". The overhead should be negligible IMO.
    Cheers and thanks again,
    Phil.
    Jim Yingst
    Wanderer
    Sheriff

    Joined: Jan 30, 2000
    Posts: 18671
    From the client point of view, field lengths must stay what they are : implicitly expressed in characters. If you you use a "2 bytes/1 character ratio" encoding, it's not a big deal to transparently handle conversions between "byte lengths" and "characters lengths". The overhead should be negligible IMO.
    Not sure which type of overhead you mean, so I'll clarify my own words. When I referred to "conceptual overhead" I didn't mean anything to do with performance or how much code must be written, but with increased complexity in understanding how the code works (either for users or other programmers).
    So - let's say we have a "name" field which has 16 bytes of storage in the DB file. Using US-ASCII, this means we can tell the user that the field length is 16, right? Now let's say he restarts the program to use UTF-16. Hurm, well for starters any old data that was stored in US-ASCII is now going to look like crap. But let's say he ignores this and only concentrates on creating some new data. If he tries to enter more than 8 chars for the name, we now need to inform the user that the field length is effectively 8, right? More than that, we don't have space in the file for storage. From the user's point of view, this is an additional complication - why did the length change? Seems weird. Perhaps, if we wanted t handle all possible encodings, we should have limited the length to 8 chars in the first place? That would be rather irritating to users who just wanted to use US-ASCII in the first place, and would like 16 chars for the field. Furthermore, 2 bytes per char may not be enough. If we did want to support UTF-8 (which I believe [i]is[/] possible, but even more work_ then many Asian languages end up taking 3 bytes per char. That's the worst-case scenario I'm aware of, but it's possible that some other encoding offers an even worse ratio. Should the name field be limited to
    5 chars on the off chance that they'll be Asian-language chars encoded in UTF-8? Ugh.
    I'm thinking that some user configurability would be nice here. But I think anything other than sungle-byt encodings is too much trouble for now. And changing encoding isn't an option that needs to be offered to every user, as it's not to be undertaken lightly. You don't want to have a DB file in which some older records used one encoding and newer records use a different one. So I'd keep this option out of any GUI config screen that's accessible to the average user. Maybe just let encoding be configured by editing the props file. (I know, the user shouldn't be required to do this - but in the requirements they're not expected to change encoding anyway; I don't see a problem here.) Making this configurable only from the props file limits accessibility to people willing to get their hands dirty - these people are (hopefully) a little more trustworthy and knowledgeable than Joe User, and if not, well, it's their own fault for editing a props file without understanding the consequences. I'd include comments in the file explaining that encoding must be 1 chr per byte, and warning that existing data may no longer be interpreted correctly if encoding is changed. (Unfortunately comments are lost by the store() method of Properties, so this should also be documented somewhere else.)
    Good discussion - thanks!
    Vlad Rabkin
    Ranch Hand

    Joined: Jul 07, 2003
    Posts: 555
    Hi,
    I agree with Jim, that is what I have done.
    Vlad
    Philippe Maquet
    Bartender

    Joined: Jun 02, 2003
    Posts: 1872
    Hi Jim,
    Not sure which type of overhead you mean, so I'll clarify my own words. When I referred to "conceptual overhead" I didn't mean anything to do with performance or how much code must be written, but with increased complexity in understanding how the code works (either for users or other programmers).

    In a Field class, is

    of so much increased complexity ?
    So - let's say we have a "name" field which has 16 bytes of storage in the DB file. Using US-ASCII, this means we can tell the user that the field length is 16, right? Now let's say he restarts the program to use UTF-16. Hurm, well for starters any old data that was stored in US-ASCII is now going to look like crap. But let's say he ignores this and only concentrates on creating some new data. If he tries to enter more than 8 chars for the name, we now need to inform the user that the field length is effectively 8, right? More than that, we don't have space in the file for storage. From the user's point of view, this is an additional complication - why did the length change? Seems weird.

    What you seem to miss here is that changing the encoding of a table supposes that you convert the table of course. And if you move from a 1/1 to a 1/2 chars_to_bytes ratio encoding, you need to multiply by 2 all string field lengths (as there are expressed in bytes in the header). It means that a "name" field which has 16 bytes of storage in the source table will get a 32 bytes of storage in the converted table.
    So I'd keep this option out of any GUI config screen that's accessible to the average user. Maybe just let encoding be configured by editing the props file. (I know, the user shouldn't be required to do this - but in the requirements they're not expected to change encoding anyway; I don't see a problem here.)

    I fully agree with you here.
    I'd include comments in the file explaining that encoding must be 1 chr per byte, and warning that existing data may no longer be interpreted correctly if encoding is changed. (Unfortunately comments are lost by the store() method of Properties, so this should also be documented somewhere else.)

    I think a better place would be the Data javadoc documentation. I would explain :
  • that a given encoding cannot be changed on a given table without converting it
  • that such a conversion is easy to perform using the Data class itself, giving some code snippet as an example of such a conversion.


  • Cheers,
    Phil.
    Jim Yingst
    Wanderer
    Sheriff

    Joined: Jan 30, 2000
    Posts: 18671
    In a Field class, is
    [getLength() code]
    of so much increased complexity ?

    No. But as you see later, I was talking about at different form of the solution, in which field lengths (in chars) change when encoding is changed. While you're talking about changing the DB file's fields length in bytes to keep char length constant. So for your version, ignore most of my previous comments. For "my version" (which I don't endorse, but it's the version I was discussing) - the code still needn't be that complex, but the fact that the length in chars changes will create confusion, espeically for end users but also fo other programmers.
    What you seem to miss here is that changing the encoding of a table supposes that you convert the table of course.
    OK, interesting. This seems to solve most of the objections I had, but create new ones. Let's see - this still wouldn't work well for a variable-length encoding like UTF-8, but you already discounted that possibility, so we're in agreement there. It seems the big problem is the conversion process. An US-ASCII DB file is unreadable if you assum UTF-16, so you'd need a conversion program to make a new DB file. Not that this is too difficult, but it is more work. I'd say that unless you actually provide a working conversion program, the ability to reconfigure the encoding to UTF-16 is useless. Even if you have a conversion program, there are issues of how does the user know when to run it? I suppose it would best be run from the same GUI screen where you configure the encoding in the first place. The moment you change the encoding, maybe a popup should say "changing encoding will force the DB file to be reformatted - are you sure you want to do this now?" Maybe it should have the option of making a new DB file using the new encoding, rather than replacing the old one. Hmmm...
    Most of this discussion has gone way beyond what I think is useful or necessary to actually implement for the assignment, but I find it interesting to speculate about future enhancements nonetheless. Let me know if you do put something like this in your program - I'd be interested to hear how it goes.
    Backing up a bit though - my original motive for making the encoding somewhat flexible was that there are a couple of possible situations which I consider to be particularly likely:
    (1) B&S decide to switch to ISO-8859-1 for the benefit of international customers.
    (2) B&S determine that their DB files already contain some ISO-8859-1 chars, even though it was claimed that the files are US-ASCII.
    In the first case, there's no need to convert the chars in existing DB file, because all US-ASCII chars are also valid in ISO-8859-1. In the second case, we wouldn't want any conversion anyway since the point is that the chars are already in ISO-8859-1. So there's no need for any conversion of files. Just start using the new encoding. Where I say ISO-8859-1 you can replace with any other 8-bit "extended ASCII" encoding, such as Microsoft's Cp-1252. It's easy to go from US-ASCII to other ASCII-based schemes - but once you've left US-ASCII, it's hard to go from one extended ASCII to another; some characters may not be available at all. So B&S have one opportunity to more to a better encoding; hopefully they're choose the right one. I'm not interested in trying to fix things if they want to keep changing things after that.
    S. Ganapathy
    Ranch Hand

    Joined: Mar 26, 2003
    Posts: 194
    Hi All,
    This discussion is really interesting, good and informative. But introducing FileChannel in the assignment is introducing bit more work. Ofcourse, there is lot of thread safety is there. If the same we used with RAF, things will be so simple, I feel. As they clearly mentioned encoding is 8 bit US ASCII, it means they use 8 bits to represent a character. This is for text only fields, not for numbers. Numbers like database cookie value follow DatInput/DataOutput format. So the records only data (text) follows 8 bit representation. RAF.readFully(byte[] b) reads each byte. So I feel it will solve our purpose.
    This is my openion only. I can not dare to say anything in this discussion. Jim is guru any way.
    I thought to use FileChannel(FC) initially, but I changed my thought, and implemented with RAF only.
    Moreover, I found using FC is bit slower process.
    Comments are welcome.
    Regards,
    Ganapathy.
    Jim Yingst
    Wanderer
    Sheriff

    Joined: Jan 30, 2000
    Posts: 18671
    I think RAF is just fine for the assignment, though I prefer my own solution. RAF is a little simpler, but harder to extend if encoding is ever changed. (And I really believe that US-ASCII is a limitation that someone would want to remove in the future.). It took me a little longer to figure out how to use the NIO classes because I hadn't used them before - but the final code is pretty simple, and has no performance issues I can see.
    I know that in the past I've had horrible performance from using RAF in projects, and replacing them with streams was preferable. Once I had to provide random access to records in a huge file, and I found that I got significantly better performance by throwing away the RAF and instead, for each record to be accessed, open a new FileOutputStream, use skip() to get to the desired offset, read the record, and close the stream. Surprising, but true. But this was around JDK 1.2 (on HP-UX if it matters) - it's quite possible RAF performance has improved since then. I wouldn't know; I just avoid it whenever possible. I used RAF to read the header since it was simpler, but switched to FileChannel to access the records. Come to think of it, I suppose that column names must still be ni US-ASCII even if another encoding is used for the rest of the file - but that's not a big problem, IMO. I'll make sure it's ducumented though.
    S. Ganapathy
    Ranch Hand

    Joined: Mar 26, 2003
    Posts: 194
    Hi Jim,
    I too tried using FC, and it was simpler to use as well. Using FC has advantages too. I decided to use RAF for the assignemnt only.
    Instructions clearly say:
    All numeric values are stored in the header information use the formats of the DataInputStream and DataOutputStream classes. All text values, and all fields (which are text only), contain only 8 bit characters, null terminated if less than the maximum length for the field. The character encoding is 8 bit US ASCII.

    So, it mean column names and values in size, rate, and owner fields too are text values and in US ASCII. They are not in the number format.
    If we use "more db.db > out", we can find the text values clearly and not the number formated values.
    I chose RAF to avoid this character encoding, as RAF.readFully(byte[] b) reads byte by byte, and using simple java.lang.String constructor, I can get rid of this problem. I documented this ofcourse.
    What is IMO?
    Regards,
    Ganapathy.
    Andrew Monkhouse
    author and jackaroo
    Marshal Commander

    Joined: Mar 28, 2003
    Posts: 11404
        
      81

    Hi Ganapathy
    What is IMO

    IMO: In my opinion
    IMHO: In my humble opinion
    HTH: Hope this helps
    You will often find older people (sorry Jim and Max ) use these abreviations, while younger people (in terms of how long they have been on the net) sometimes use leetspeak (I refuse to spell it in their way) and/or use SMS type contractualisations (u = you, v = we ....).
    The use of all three has been discussed in more appropriate forums (JavaRanch and MD) from time to time.
    Regards, Andrew
    [ July 15, 2003: Message edited by: Andrew Monkhouse ]

    The Sun Certified Java Developer Exam with J2SE 5: paper version from Amazon, PDF from Apress, Online reference: Books 24x7 Personal blog
    S. Ganapathy
    Ranch Hand

    Joined: Mar 26, 2003
    Posts: 194
    Hahaha.
    Thankyou Andrew.
    I was seriously thinking to relate IMO to classes in java.
    Jim Yingst
    Wanderer
    Sheriff

    Joined: Jan 30, 2000
    Posts: 18671
    You will often find older people (sorry Jim and Max ) use these abreviations
    [sigh] Kids today, I tell you... :roll: Next you'll be asking what a FAQ is. RTFM, I tell you!
    Perhaps IMHO would've been more recognizable, but the H wouldn't really have been sincere. IMAO would often work though...
    Philippe Maquet
    Bartender

    Joined: Jun 02, 2003
    Posts: 1872
    Hi Jim,
    Most of this discussion has gone way beyond what I think is useful or necessary to actually implement for the assignment, but I find it interesting to speculate about future enhancements nonetheless.

    I fully agree with you.
    Even if you have a conversion program, there are issues of how does the user know when to run it? I suppose it would best be run from the same GUI screen where you configure the encoding in the first place. The moment you change the encoding, maybe a popup should say "changing encoding will force the DB file to be reformatted - are you sure you want to do this now?" Maybe it should have the option of making a new DB file using the new encoding, rather than replacing the old one. Hmmm...
    Let me know if you do put something like this in your program - I'd be interested to hear how it goes.

    No I didn't, because I think it would go to far, and against the YAGNI principle. But something which doesn't go against it IMO, is trying to build a database system which will offer :
  • all the basic functionalities an application writer may expect from any basic db system, even if they are not expressly required in the assignment
  • all small building blocks needed if we want to be able to write future enhancements simply.


  • Just a few examples :
  • Our db system must support multiple tables by design (including the locking issue), even if URLyBird will use only one table in the begining. Would it not be too late to take it in consideration the day your IT manager ask you to add one table to the system ?
  • It's not required to write some createTable() method : same comment.
  • It's highly probable they'll want to take a backup from time to time ... And as they intend to move the system to the web, the backup should be able to run while people are still working (taking a "photograph" of the data while people go on writing it). Another stuff which is far more easy to achieve if you take it into account from the begining !
  • (...)


  • Now I agree with you when you tell that supporting 8 bits character sets should be enough.
    I used RAF to read the header since it was simpler, but switched to FileChannel to access the records. Come to think of it, I suppose that column names must still be ni US-ASCII even if another encoding is used for the rest of the file - but that's not a big problem, IMO. I'll make sure it's ducumented though.

    That's what I did too.
    Regards,
    Phil.
    [ July 16, 2003: Message edited by: Philippe Maquet ]
     
    I agree. Here's the link: http://aspose.com/file-tools
     
    subject: NX: URLyBird 1.2.1 - Character encoding
     
    Similar Threads
    character encoding issue
    NX: URLyBird 1.3.3 -- EOFException
    URLyBird data file format
    NX URLyBird 1.3.2: Extracting null terminated strings from ByteBuffer
    NX: (Contractors) Character encoding