This week's book giveaway is in the OCMJEA forum.
We're giving away four copies of OCM Java EE 6 Enterprise Architect Exam Guide and have Paul Allen & Joseph Bambara on-line!
See this thread for details.
The moose likes Developer Certification (SCJD/OCMJD) and the fly likes NX: (Contractors) Memory Map of data file Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


Win a copy of OCM Java EE 6 Enterprise Architect Exam Guide this week in the OCMJEA forum!
JavaRanch » Java Forums » Certification » Developer Certification (SCJD/OCMJD)
Bookmark "NX: (Contractors) Memory Map of data file" Watch "NX: (Contractors) Memory Map of data file" New topic
Author

NX: (Contractors) Memory Map of data file

Jeff Wisard
Ranch Hand

Joined: Jan 07, 2002
Posts: 89
Hello,
Since I am relatively unexperienced with the new NIO features of Java 1.4.x, I was wondering if someone might be able to explain the pros and cons of mapping our data file records to memory using the FileChannel.map method.
I can see one advantage in that it would be relatively easy to create ByteBuffers for each record (or even each field) and thus it might make finding records easier (if you store each record/ByteBuffer in a HashMap with the key as the record number). This might make writing the findByCriteria, readRecord, and updateRecord methods easier to code.
So I can see some pluses from a design standpoint...but what are the cons? I read in the javadoc that there might be a performance hit doing this....but it seems that that would only be an up-front hit. Wouldn't it be faster for reads since the data is already in memory?
Has anyone else used MappedByteBuffer for this exam?
Thanks!


Jeff Wisard<br />Sun Certified Java Programmer (Java 2)<br />Sun Certified Web Component Developer
james render
Ranch Hand

Joined: May 08, 2003
Posts: 72
Hi Jeff, was thinking about using the MappedByteBuffer myself.
I vaguely recall reading somewhere that it wasn;t too efficient if your file is 10k or under but more importantly is that:
'once a file has been mapped into a memory buffer, there is no guarantee that another application or thread will not alter the physical file by mapping it into a byte buffer.
Therefore, any application can edit and alter the contents of the physical file, while the data buffer remains unchanged.
MappedByteBuffer instances should only be used when there is no need to worry about maintaining the data synchronization between the buffer and the source'.
p.281 "the sun certified java developer exam with j2se 1.4"
apologies if already read/knew that.
I don't think that this is necessarily a reason not to use the MappedByteBuffer but a warning about taking necessary precautions. Maybe if you can guarantee that multiple threads don;t have their own buffers getting out of synch with each other and the file, and that you write an assumption that no other applications are editing the file in the background....
bit of mind dump for myself as I've yet to make a decision on my I/O technique
hope this offers up some ideas..
james


[SCJP][SCWCD][SCJD]
Jeff Wisard
Ranch Hand

Joined: Jan 07, 2002
Posts: 89
Thanks for your reply, James.
Not much new information. I've gone ahead and implemented my Data class using memory mapping. I memory map each record in the database (using MappedByteBuffer) and then store each record in a HashMap with the record number as the key.
This seems to make writing the readRecord, updateRecord, and findByCriteria methods simpler...although that may just be my perception.
You're right in that I will have to document that I assume no outside application will be editing the underlying database file.
If anyone else has any comments, I would appreciate hearing them.
Thanks.
Jeff Wisard
Ranch Hand

Joined: Jan 07, 2002
Posts: 89
Actually, after just having re-read this from my instructions...

The company's IT department has a data file that contains the essential information for the company, but because the data must continue to be manipulated for reports using another custom-written application, the new system must reimplement the database code from scratch without altering the data file format.

I think I am now going to re-write my Data class (again) to NOT used memory mapped buffers. This explicitly says that another application manipulates this data. Damn!
Jim Yingst
Wanderer
Sheriff

Joined: Jan 30, 2000
Posts: 18671
Nah - see this under "Locking":
You may assume that at any moment, at most one program is accessing the database file; therefore your locking system only needs to be concerned with multiple concurrent clients of your server.

The way I figure it, we need to keep the existing file format because some other program may need it at another time, when our server's not running. But while the server's running, no one else should have access to the DB file. I'm requesting a FileLock on the whole file to add an extra level of security. No, this doesn't guarantee exclusive access on all systems (it's OS dependant) - but we're not really supposed to have to worry about other programs accessing the DB file ayway. I figure getting a FileLock is an easy-to-do extra precaution that shouldn't really be necessary.
I think memory mapping is probably worthwhile since it's a one-time thing, and we're going to be accessing the file repeatedly.


"I'm not back." - Bill Harding, Twister
Jim Yingst
Wanderer
Sheriff

Joined: Jan 30, 2000
Posts: 18671
One problem with memory mapping though is what to do when creating new records which must be appended to the existing file? The original mapping is only valid for the existing file. One would need to re-map to include all of a newly-enlarged file. There are probably ways to do this without having to re-map too often (when file growth is needed, grow more than the needed amount, so you don't re-grow as often) but my feeling is it's probably a bit more complex than is warranted here, considering that performance did not seem to be a big issue in the first place for this assignment. So nevermind, no memory mapping for me...
[ May 17, 2003: Message edited by: Jim Yingst ]
Jeff Wisard
Ranch Hand

Joined: Jan 07, 2002
Posts: 89
Well, outside of the fact that I won't be doing this (as I stated in my previous post), it would actually be pretty easy to add new records...
What I had planned (before I canned the idea) was to map each record by itself and store the reference to the ByteBuffer for that record in a HashMap with the record number as the key. Therefore, adding new records simply means writing the record to the file and then mapping that single new record. Then add that with the new record number to the HashMap.
Anyway, its a moot point. My requirements indicate that I can't do memory mapping.
Thanks!
Jeff Wisard
Ranch Hand

Joined: Jan 07, 2002
Posts: 89
Oh wait! I didn't read your first reply. You're right...I think maybe I can use a Memory Map as long as I point out that the locking scheme depends on the fact that ONLY my server will be accessing the database file. (Otherwise, you can never be sure that your updates will succeed...).
Thanks...I missed that idea.
I would, however, like to know what Max or Mark have to say about using a memory map.
Jim Yingst
Wanderer
Sheriff

Joined: Jan 30, 2000
Posts: 18671
Well, outside of the fact that I won't be doing this (as I stated in my previous post)
Understood. But it's something I was thinking about doing too, and probably others. So I'm considering pros and cons of the idea, regardless of whether you use it.
it would actually be pretty easy to add new records...
Well, it may be easy to code if we map each record independantly - but it seems to me that this just magnifies the performance hit you get from memory mapping. I mean, the API suggests mapping's not worthwhile unless we're dealing with a file size in excess of "tens of kilobytes". Maybe we can justify mapping a whole DB file based on the fact it will get reused a lot - but mapping a single record? 183 bytes long? How much re-use do we anticipate for that one record? I suppose if enough searches are run - one search typically involves accessing all records, so repeated searches could run up the re-use factor to where mapping is worthwhile. But I'm still pretty skeptical of this.
Jeff Wisard
Ranch Hand

Joined: Jan 07, 2002
Posts: 89
My first implementation involved mapping all the records into one buffer. I then sliced that buffer into individual buffers for each record. So really, I am using one large buffer for the map...but dividing it up for convenience. I store those convenient record-sized buffers in the HashMap and discard the reference to the original large buffer. I think that the record-size buffers simply use the parent buffer for reading and writing...
But the problem then becomes, "What happens when you want to add new records?" Like you said, this would mean re-mapping everything again when you create a new record. At this point, I thought I would just create a new record-sized buffer with no underlying parent buffer. But I think you may be right. Over time, this could mean more of the file is using the small maps than the large one.
Of course, we are not required to implement the create method....but our design should take into consideration that the create method will be implemented later.
I'm going to have to think this one over yet again. Does anyone else have any input?
Jim Yingst
Wanderer
Sheriff

Joined: Jan 30, 2000
Posts: 18671
Of course, we are not required to implement the create method
We're not? I see that the GUI isn't required to create new records, or delete, or update anything other than the customer ID apparently. But the Data class is required to "implement" the interface DB. I guess that could just mean we have to write "Date implements DB" and make sure there are method stubs to at least satisfy the compiler. But to me it seems as though we're supposed to provide working implementations of the methods, which do what the comments say they do. At least, that's what I've been doing - hope it's not looked on as "wasted" work. I've got a nice JUnit class to excerise each of these methods - if I was one of Sun's graders, I'd be using it as part of my grading process. I sort of assumed that was a main motivation for them spelling out class and method names here - so their graders could more easily interface with the code to test it. Has anyone done this assignment without fully implementing the listed methods? And did they pass?
Jeff Wisard
Ranch Hand

Joined: Jan 07, 2002
Posts: 89
Just by reading other posts, I came to the conclusion that I would throw an UnsupportedOperationException from the createRecord and deleteRecord methods of data. Why implement them if the client will never use them? Write in the documentation that they will be saved for a future iteration when they are needed.
However, I am reconsidering in the light of using a memory map. I am thinking that I might try to implement that map in a way similar to how an ArrayList (or other collection) grows its internal array implementation when objects are added to it. That is, allocate the memory map buffer to be a certain percentage bigger than what is needed to store the entire set of records. Then, in the createRecord method, check to see if the next record will fit into the buffer before writing it. If it can't, then reallocate the buffer...again providing extra space.
I think this solution will work (although I haven't tried it yet) and be acceptable because the create operation will probably not occur very often...at least compared to the reads and updates.
One thing that must happen in the createRecord method is to synchronize on a record count to ensure that two new records do not get the same record number...
Anyway, I'm still thinking this over. This is getting complicated...maybe too complicated. The instructions do say to keep it simple.
[ May 20, 2003: Message edited by: Jeff Wisard ]
Jim Yingst
Wanderer
Sheriff

Joined: Jan 30, 2000
Posts: 18671
My thinking is, we've got to be able to update() at least one field, and it's easiest to make it able to update all fields; once you've got that ability, the code to create() or delete() is pretty similar, and easy to do at that point. Well, memory re-mapping is an extra complication if you're doing that, and if you want create to re-use deleted record space when available (optional), that's another. I've already implemented the latter; not a big deal. I've probably spent enough time on the Data class for now. I can spawn 500 threads running various types of find(), and they typically all complete in under a second (and much better for 10 or 100 threads), so I'm going to stop worrying about performance for now. I'll be interested to hear how/if the memory mapping works out for you though.
[ May 20, 2003: Message edited by: Jim Yingst ]
Perry Board
Greenhorn

Joined: May 03, 2003
Posts: 29
Regarding implementing create/delete, I found it invaluable in testing my data class. The db file Sun provides is so tiny and limited that I wanted to spawn some "daemon" threads to do busy work adding/deleting records to see if I could raise any anomolies and to fill out the database with various dated records and so forth, so I could test my search method.
Even without this helpful side-benefit, I sort of think that the instructions to "implement DBMain" means to write the code. If Sun wants it written and you don't do it, you lose big. If Sun doesn't want it but you do, you lose small.
Jim Yingst
Wanderer
Sheriff

Joined: Jan 30, 2000
Posts: 18671
Good point. The db file provided by Sun has no deleted records initially - but you need to be able to test that read() throws RecordNotFoundException when called on a deleted record. (They do tell us other processes need to be able to work with the file, though not simultaneously - this could mean the other processes may delete records when we're not running.) You could hand-edit a file for testing purposes, but it's cool if you can make it by calling delete(). Come to think of it, now I want to create a really huge data file to see how the performance holds up, or if there are any other bugs that reveal themselves...
[ May 20, 2003: Message edited by: Jim Yingst ]
Jeff Wisard
Ranch Hand

Joined: Jan 07, 2002
Posts: 89
Based on our discussion, I am throwing out the memory map idea for now. You've also convinced me that implementing create and delete are good ideas also.
Thanks...this thread helped a great deal.
Jeff Wisard
Ranch Hand

Joined: Jan 07, 2002
Posts: 89
...but one question about create: If you found a deleted record to replace with your new record, did you lock that record before over-writing it? It seems to me you have to perform the read->lock->read->update->unlock sequence in this scenario.
I only ask this (I'm sure its necessary) because the signature of the createRecord method does not include a lock cookie.
While I'm thinking out loud here: Are you synchronizing on something to ensure that two threads are not creating a new record at the same time? Is that necessary?
Perry Board
Greenhorn

Joined: May 03, 2003
Posts: 29
I don't allow locking of deleted records (throws a RecordNotFoundException) so my create method, if it's reusing a deleted row, first wipes it clean with all zero bits--thereby undeleting it, then immediately locks the record and updates with the new data.
And yes, it's important to still synchronize in create, since you don't want two different threads selecting the same deleted row and thinking they can wipe/use it.
Here's a wierd situation that came to light by using my legions of spawned daemons:

Conclusion was to not allow create to reuse a deleted record unless it was also unlocked. I suppose I might have automatically unlocked in my delete method, and I can't recall why I didn't do that.
Edit: I should add that my assignment doesn't use lock cookies.
[ May 21, 2003: Message edited by: Perry Board ]
Thomas Kijftenbelt
Ranch Hand

Joined: Feb 13, 2002
Posts: 73
Hi guys,
I have been reading the whole memory map discussion. Although I am using NIO, I am not using the memory map functionality:
In my init() I create a RandomAccessFile and get a FileChannel to this RandomAccessFile. When I read a record I read that record from my filechannel into a byteBuffer (which is 1 record long)... when updating I create a bytebuffer (1 record long) and then write this bytebuffer to the correct position in the filechannel -> don't forget to synchronize on the filechannel object as position & write are not atomic...
I also did not implement the create / delete methods (just throw UnsupportedOperationException), but based on the discussion above I'll have a look tonight to implement create() and delete() -> it is certainly convenient for testing purposes.
Greetings,
TK
[ May 21, 2003: Message edited by: Thomas Kijftenbelt ]
Jeff Wisard
Ranch Hand

Joined: Jan 07, 2002
Posts: 89
Hey Thomas,
I think it is possible to NOT synchronize on the FileChannel object as long as you don't rely on its position attribute. There are read, get, and put methods that don't modify the position and are therefore atomic calls.
Just thought you might be interested in this. I don't ever set the position in my FileChannel.
Jim Yingst
Wanderer
Sheriff

Joined: Jan 30, 2000
Posts: 18671
Hmmm, I hadn't noted that FileChannel can be used without synchronization like that. Currently I'm syncing on the FileChannel just as Thomas described, and setting position at various times - but I can probably revise that. Cool, thanks for pointing it out.
As for delete/create and synchronization: in my design, I have a List in memory containing RecordMonitor objects for every single record (even the deleted ones). Since the API requires long cookie values for locking, I needed a place to store them; I found it easiest to just keep that RecordMonitor around permanently, rather than trying to only have them for records that have actually been locked. (There's an isLocked attribute to determine whether a given RecordMonitor is actually held by anyone). Anyway, as long as these objects exist, even for deleted records, they serve as convenient monitors to control access to a given single record. (Hence the name.) So for example when I create() a new record using a deleted record, I first sync on its monitor, and then check again that the records is currently deleted. Then write the new record info, and exit the sync block. So there's no way another thread could concurrently create() a record using the same deleted record. It would block on the RecordMonitor sync until the first thread completed its create.
Note that there's a difference between syncing on a RecordMonitor, and locking the associated record. The former acquires what we call a "lock" for synchronization purposes, but not a "lock" on the record in the sense of the API's lock(recNo) method. I try to avoid using the word "lock" when referring to synchronizing on a monitor, to minimize this confusion.
With this design, I don't believe it matters whether a deleted record is locked or not. Any significance of the original lock (and associated cookie value) is lost once a record is deleted. A record had to be locked in order to be deleted; I don't bother unlocking it as part of the delete. But if the program exits and restarts, the new RecordMonitor for the deleted record will start out unlocked. I suppose I could make these more consistent, but I don't think it matters - if a record is deleted, the only thing that can possibly be done with it is create a new record on top of it. As long as no two threads can do this concurrently (thanks to syncing on the RecordMonitor, not actually locking the record), what else could go wrong? (Famous last words, probably...)
Jeff Wisard
Ranch Hand

Joined: Jan 07, 2002
Posts: 89
Very interesting.
Question: What do you do when creating a new record (not replacing a deleted one)?
(BTW...I simply synchronized my createRecord method with the justification that the performance hit will be minimal due to the fact that creates will not happen often)
Also, do your lock cookies live as long as your RecordMonitor objects? That is, when you create a RecordMonitor for a record, do you assign it a lock cookie then that stays with that record for the life of the RecordMonitor? Seems to me that might be dangerous because a client could hold onto a lock cookie and potentially unlock a record that was locked by some other client.
Jim Yingst
Wanderer
Sheriff

Joined: Jan 30, 2000
Posts: 18671
Question: What do you do when creating a new record (not replacing a deleted one)?
Prepare ByteBuffer of new record, create a new monitor (unlocked), sync on the monitors list, put monitor at end of list, (remember position - this is new recNo), write ByteBuffer to file, exit sync block. Because the monitor list can be structurally modified, I sync on it for all accesses to monitors. Not sure if this is necessary for most reads; probably not. But for anything related to how big the list is, yes.
I do have a lot of different levels of synchronization being used within the Data class; I'm considering simplifying it a bit. E.g. rather than sync on the monitor list, just sync on the Data instance. Same for several other operations. The multiple monitors allow greater concurrency - e.g. one thread can look up a monitor while another writes to the channel) but more room for confusion and possible deadlock if I'm not careful, so I'm still playing with that part of the design. If I can eliminate syncing on the channel, that helps.
(BTW...I simply synchronized my createRecord method with the justification that the performance hit will be minimal due to the fact that creates will not happen often)
Right, that's the sort of simplification I'll quite possibly make. But I'm setting up some multithreaded performance tests first so I see how big an impact these changes have.
Also, do your lock cookies live as long as your RecordMonitor objects? That is, when you create a RecordMonitor for a record, do you assign it a lock cookie then that stays with that record for the life of the RecordMonitor? Seems to me that might be dangerous because a client could hold onto a lock cookie and potentially unlock a record that was locked by some other client.
No, a given cookie value is only valid from lock() to unlock(). When lock() is called, I generate a new random long - that's the new cookie, saved in the record monitor and also returned to the caller of lock(). When a valid unlock() is received, the cookie in the monitor is set to 0. What happens with delete() was a special case where the cookie becomes irrelevant, so I didn't bother zeroing it out - but I guess I'll change that now, just to make sure readers can see more clearly that there's no way the deleted record can be modified. (Even though that's really prevented elsewhere, not by the cookie check.)
Thomas Kijftenbelt
Ranch Hand

Joined: Feb 13, 2002
Posts: 73
Hi,
I think it is possible to NOT synchronize on the FileChannel object as long as you don't rely on its position attribute. There are read, get, and put methods that don't modify the position and are therefore atomic calls.

I guess I'm a bit confused now: when writing my single record ByteBuffer to the FileChannel I have to indicate where to put is...
I guess I have two options:
- use position() and write(ByteBuffer)
- use write(position, bytebuffer) - or the other way around (I don't know by heart)
For each I have to sync on the FileChannel in order to complete atomically...
Greetings,
TK
Jim Yingst
Wanderer
Sheriff

Joined: Jan 30, 2000
Posts: 18671
From the FileChannel API:
File channels are safe for use by multiple concurrent threads. The close method may be invoked at any time, as specified by the Channel interface. Only one operation that involves the channel's position or can change its file's size may be in progress at any given time; attempts to initiate a second such operation while the first is still in progress will block until the first operation completes. Other operations, in particular those that take an explicit position, may proceed concurrently; whether they in fact do so is dependent upon the underlying implementation and is therefore unspecified.

So each individual FileChannel method call is thread-safe and can be considered atomic. (Or non-atomic but only if there's no instance data affected, so it doesn't matter). There's still one catch to beware of here - if you call position() and read() separately, it's possible for another thread to interrupt in between these calls, and reposition the channel elsewhere before the read. Which would be potentially confusing, to say the least. But as long as you use methods that take an explicit position rather than relying on one that was set previously, you're fine.
Note that read(ByteBuffer, long) really needs to be called in a loop to ensure that the desired number of bytes is filled - but this isn't a problem. Even if the total read gets split into several separate reads, each one is thread-safe, and as long as you update the position correctly for subsequent reads it doesn't matter if other threads insert themselves between reads - the explicit position parameter ensures that each individual read operation occurs where it's supposed to.
But wait - there is a problem. It's possible for another thread to update a record while you're in the midst of reading it - precisely because none of the read() methods can be guaranteed to read all the bytes you request at once (hence the need for a loop). In proctice this is fairly unlikely. Given the relatively short record length, I expect that FileChannel will most always read() a single record in just one read(). Unless the file is fragmented maybe. But we can't guarantee that. And so it seems necessary to have some sort of synchronization to ensure that no read and updates of the same record are taking place concurrently. Any other concurrent access to the FileChannel seems OK - but not to the same record. In my case, this is handled by synchronizing on the RecordMonitor objects - so no additional sync on the FileChannel is necessary. Other designs doubtless handle this differently.
[ May 25, 2003: Message edited by: Jim Yingst ]
Max Habibi
town drunk
( and author)
Sheriff

Joined: Jun 27, 2002
Posts: 4118
I recommend against the use of the Mapped memory buffer, because how it actually works is platform specific.
HTH,
M


Java Regular Expressions
Thomas Kijftenbelt
Ranch Hand

Joined: Feb 13, 2002
Posts: 73
Hi Jim,
File channels are safe for use by multiple concurrent threads. The close method may be invoked at any time, as specified by the Channel interface. Only one operation that involves the channel's position or can change its file's size may be in progress at any given time; attempts to initiate a second such operation while the first is still in progress will block until the first operation completes. Other operations, in particular those that take an explicit position, may proceed concurrently; whether they in fact do so is dependent upon the underlying implementation and is therefore unspecified.

OK, I read this part of the specification as well, and based on the last sentence I thougth that it depends on the underlying implementation whether proceeds atomically?!
Note that read(ByteBuffer, long) really needs to be called in a loop to ensure that the desired number of bytes is filled - but this isn't a problem.

I use another aproach: I create a ByteBuffer with the size of a record, and read fill the ByteBuffer in one step:

Greetings,
TK
Jim Yingst
Wanderer
Sheriff

Joined: Jan 30, 2000
Posts: 18671
Ah, looks like I was careless in my reading there. You're right, there are some things that could happen concurrently and screw things up - if we ues the methods that take an explicit position, rather than using the FileChannel's instance field for postion. For one: two or more threads may read the same record simultaneously. This isn't a problem though - each thread will be able to read the same sequence of bytes. For another: one thread may be reading a record while another is updating it. Or two threads may be updating the same record with different values. These are both bad, because they can result in an inconsistent record. E.g. if the record was for
Buonarotti & Company
and someone was updating it to
Moore Power Tool Ya
then, depending on the order of events, you might read the record as
Buonarotti &Tool Ya
or
Moore Poti & Company
or any number of other possibilities. Which would be wrong. Fortunately my own design is safe from this, because I sync on a RecordMonitor object to prevent multiple threads accessing the same record simultaneously (while still allowing concurrent access to other parts of the file). However I was mistaken to call these methods atomic, in general - without some form of synchronization, there may still be problems with concurrent access on some platforms. So, good catch - thanks for pointing it out.
I use another aproach: I create a ByteBuffer with the size of a record, and read fill the ByteBuffer in one step:
Well, here again the API doesn't quite give the guarantees we'd like to have. From the API for ReadableByteChannel's read() method:

A similar problem and solution exist for the write() method.
This part of the API is really rather annoying to me. They (Sun) give precise rules about blocking and onblocking channels - but FileChannel is neither. I can't tell if they intentionally left this unspecified, or if they were just careless in the spec, figuring everyone would assume that all requested bytes are read. (People usually assume that for traditional IO as well, and I know it's not true there.) Until I hear otherwise from an official source though, I'm always going to be checking the return value of any read() or write().
[ May 31, 2003: Message edited by: Jim Yingst ]
Thomas Kijftenbelt
Ranch Hand

Joined: Feb 13, 2002
Posts: 73
Hi Jim,
Based on the above my solution is to synchronize on the FileChannel:

Greetings,
TK
PS Discussing these topics at this level is good fun; I'll now finish my JUnit tests and look how the whole thing performs...
[ June 01, 2003: Message edited by: Thomas Kijftenbelt ]
 
 
subject: NX: (Contractors) Memory Map of data file