aspose file tools*
The moose likes Developer Certification (SCJD/OCMJD) and the fly likes NX: Bodgitt and Scarper - data file access caveats??? Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Certification » Developer Certification (SCJD/OCMJD)
Bookmark "NX: Bodgitt and Scarper - data file access caveats???" Watch "NX: Bodgitt and Scarper - data file access caveats???" New topic
Author

NX: Bodgitt and Scarper - data file access caveats???

Timothy Johnson
Greenhorn

Joined: Sep 14, 2003
Posts: 13
Gurus,
I've read most of the posts here relating to reading and writing bytes to/from the data file. This is what I've come up with and I want to make sure that I'm not doing anything blatantly idiotic. First, I'll post the data file format and then my assumptions.
**** Data File Format Start ****
Start of file
4 byte numeric, magic cookie value identifies this as a data file
4 byte numeric, offset to start of record zero
2 byte numeric, number of fields in each record
Schema description section.
Repeated for each field in a record:
2 byte numeric, length in bytes of field name
n bytes (defined by previous entry), field name
2 byte numeric, field length in bytes
end of repeating block
Data section. (offset into file equal to "offset to start of record zero" value)
Repeat to end of file:
2 byte flag. 00 implies valid record, 0x8000 implies deleted record
Record containing fields in order specified in schema section, no separators between fields, each field fixed length at maximum specified in schema information
End of file
All numeric values are stored in the header information use the formats of the DataInputStream and DataOutputStream classes. All text values, and all fields (which are text only), contain only 8 bit characters, null terminated if less than the maximum length for the field. The character encoding is 8 bit US ASCII.
**** Data File Format End ****
- for the numeric values, I should be using RandomAccessFile#readInt and #readShort
- the valid record flag should equal a string of "\u0000\u0000" and the delete field flag should equal a string of "\u8000"
- I should be using RandomAccessFile#readFully instead of #read when loading my byte[] objects
- When I convert the bytes I read into a String, I should do a new String(bytes,"US-ASCII") and a strObj.getBytes("US-ASCII") on writes
- "US-ASCII" is really 7 bit and I need 8 bit. Am I missing something here or do I need another encoding?
- I'm not sure of the best way to handle my delete flag writes, RandomAccessFile#writeChars("\u8000")???
- Even though it's been highly debated, I think I'll keep from trimming the spaces following many of the values in the data file, when I read them into memory.
- When reading in the field values, I'll have to loop through the chars and find the first null, everything before that will be my field value.

Thanks a lot gang.
-Tim
Andrew Monkhouse
author and jackaroo
Marshal Commander

Joined: Mar 28, 2003
Posts: 11508
    
  95

Hi Tim,
Nice summation of so many discussions.
"US-ASCII" is really 7 bit and I need 8 bit. Am I missing something here or do I need another encoding?

Welcome to the wonderful world of clueless user specifications. :roll:
You have to make a design decision. Is it likely the user wants US-ASCII or an 8-bit format?
By the way: UTF-8 "uses all bits of an octet, but has the quality of preserving the full US-ASCII range: US-ASCII characters are encoded in one octet having the normal US-ASCII value, and any octet with such a value can only stand for an US-ASCII character, and nothing else." (from the UTF-8 RFC).
I'm not sure of the best way to handle my delete flag writes

Have you considered converting 0x8000 into the equivalant short value, and reading and writing it that way?
When reading in the field values, I'll have to loop through the chars and find the first null, everything before that will be my field value.

Presumably stopping if you reach the end of a field length without finding a null.
***
This all sounds pretty good. It sounds like you have made a few design decisions to get to what you have written. Have you documented them?
Regards, Andrew


The Sun Certified Java Developer Exam with J2SE 5: paper version from Amazon, PDF from Apress, Online reference: Books 24x7 Personal blog
Vlad Rabkin
Ranch Hand

Joined: Jul 07, 2003
Posts: 555
Hi Timothy,
I join Andrew's statements, expecially that one:
Have you considered converting 0x8000 into the equivalant short value, and reading and writing it that way?


Best,
Vlad
Timothy Johnson
Greenhorn

Joined: Sep 14, 2003
Posts: 13
Gentlemen,
Thanks for your input. Yeah, my choices.txt is growing by leaps and bounds but I'm learning a lot in the process.

I'm still a little stumped on the encoding though... I saw one individual claim to have gotten a 91% using the default encoding, Philippe M. is a proponent of "US-ASCII", and Andrew seems to imply that "UTF-8" is the way to fly. Hmmmm....

As far as the delete flag is concerned... I could use RandomAccessFile.writeShort(Character.getNumericValue('\u8000')) but what's the real advantage over using RandomAccessFile.writeChars(new String("\u8000","UTF-8")) or even RandomAccessFile.writeChar(Character.getNumericValue('\u8000'))?

Thanks fellas,
Tim
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: NX: Bodgitt and Scarper - data file access caveats???