File APIs for Java Developers
Manipulate DOC, XLS, PPT, PDF and many others from your application.
http://aspose.com/file-tools
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

NX: Bodgitt and Scarper - data file access caveats???

 
Timothy Johnson
Greenhorn
Posts: 13
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Gurus,
I've read most of the posts here relating to reading and writing bytes to/from the data file. This is what I've come up with and I want to make sure that I'm not doing anything blatantly idiotic. First, I'll post the data file format and then my assumptions.
**** Data File Format Start ****
Start of file
4 byte numeric, magic cookie value identifies this as a data file
4 byte numeric, offset to start of record zero
2 byte numeric, number of fields in each record
Schema description section.
Repeated for each field in a record:
2 byte numeric, length in bytes of field name
n bytes (defined by previous entry), field name
2 byte numeric, field length in bytes
end of repeating block
Data section. (offset into file equal to "offset to start of record zero" value)
Repeat to end of file:
2 byte flag. 00 implies valid record, 0x8000 implies deleted record
Record containing fields in order specified in schema section, no separators between fields, each field fixed length at maximum specified in schema information
End of file
All numeric values are stored in the header information use the formats of the DataInputStream and DataOutputStream classes. All text values, and all fields (which are text only), contain only 8 bit characters, null terminated if less than the maximum length for the field. The character encoding is 8 bit US ASCII.
**** Data File Format End ****
- for the numeric values, I should be using RandomAccessFile#readInt and #readShort
- the valid record flag should equal a string of "\u0000\u0000" and the delete field flag should equal a string of "\u8000"
- I should be using RandomAccessFile#readFully instead of #read when loading my byte[] objects
- When I convert the bytes I read into a String, I should do a new String(bytes,"US-ASCII") and a strObj.getBytes("US-ASCII") on writes
- "US-ASCII" is really 7 bit and I need 8 bit. Am I missing something here or do I need another encoding?
- I'm not sure of the best way to handle my delete flag writes, RandomAccessFile#writeChars("\u8000")???
- Even though it's been highly debated, I think I'll keep from trimming the spaces following many of the values in the data file, when I read them into memory.
- When reading in the field values, I'll have to loop through the chars and find the first null, everything before that will be my field value.

Thanks a lot gang.
-Tim
 
Andrew Monkhouse
author and jackaroo
Marshal Commander
Pie
Posts: 11833
181
C++ Firefox Browser IntelliJ IDE Java Mac Oracle
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi Tim,
Nice summation of so many discussions.
"US-ASCII" is really 7 bit and I need 8 bit. Am I missing something here or do I need another encoding?

Welcome to the wonderful world of clueless user specifications. :roll:
You have to make a design decision. Is it likely the user wants US-ASCII or an 8-bit format?
By the way: UTF-8 "uses all bits of an octet, but has the quality of preserving the full US-ASCII range: US-ASCII characters are encoded in one octet having the normal US-ASCII value, and any octet with such a value can only stand for an US-ASCII character, and nothing else." (from the UTF-8 RFC).
I'm not sure of the best way to handle my delete flag writes

Have you considered converting 0x8000 into the equivalant short value, and reading and writing it that way?
When reading in the field values, I'll have to loop through the chars and find the first null, everything before that will be my field value.

Presumably stopping if you reach the end of a field length without finding a null.
***
This all sounds pretty good. It sounds like you have made a few design decisions to get to what you have written. Have you documented them?
Regards, Andrew
 
Vlad Rabkin
Ranch Hand
Posts: 555
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi Timothy,
I join Andrew's statements, expecially that one:
Have you considered converting 0x8000 into the equivalant short value, and reading and writing it that way?


Best,
Vlad
 
Timothy Johnson
Greenhorn
Posts: 13
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Gentlemen,
Thanks for your input. Yeah, my choices.txt is growing by leaps and bounds but I'm learning a lot in the process.

I'm still a little stumped on the encoding though... I saw one individual claim to have gotten a 91% using the default encoding, Philippe M. is a proponent of "US-ASCII", and Andrew seems to imply that "UTF-8" is the way to fly. Hmmmm....

As far as the delete flag is concerned... I could use RandomAccessFile.writeShort(Character.getNumericValue('\u8000')) but what's the real advantage over using RandomAccessFile.writeChars(new String("\u8000","UTF-8")) or even RandomAccessFile.writeChar(Character.getNumericValue('\u8000'))?

Thanks fellas,
Tim
 
I agree. Here's the link: http://aspose.com/file-tools
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic