posted 18 years ago
background:
i have to do some processing of large text files -- 1 GB and larger, and i'll be processing as many as 12 in a row. "larger": i have one file that is 3.5 GB. something like, going through a directory of these files and processing each one.
i'm trying to work out how i can do this efficiently, and NIO seems like the good bet. i already have written an application to process these files using standard i/o. it pegs my cpu and takes a very long time. that wasn't a showstopper for an occasional single file, but now i have to do it in a production environment daily, for a dozen or more files.
my "processing" is mostly filtering (examine the contents of a field and throw out the lines that contain specific data), and occasionally swapping the contents of two fields. therefore, my feeling is that i/o is causing the performance crisis.
therefore, my thought was to memory-map the files in chunks, in read and write buffers, and do the processing between these maps. from my reading, my understanding is that changes in the write buffer are written back to the disk copy automatically.
problem:
after spending a good number of hours today, reading about NIO through online resources, i can't figure out a design for accomplishing this. there's lots of high-level overviews with code snippets, but no practical problem-solving demonstrations with NIO.
1. can anyone suggest a good design for accomplishing my task? i understand that i need to open a channel on a FileInputStream to read and on a FileOutputStream to write, but i don't understand how to set up a buffer that i can process line-by-line. i have to process the files line by line.
2. how do i read the file in chunks? it's not clear to me that i have a pointer in the disk copy that functions like a cursor in a record set. i know i have a 'position' pointer in the buffer. from the file-reading standpoint, it would seem to make most sense to simply reuse the same buffer.
2. how significant is the size of the buffer? i tried creating a 200MB buffer, but it blew up the VM:
i have no notion of how much buffer is enough and i found no discussion of this topic.
3. i can't find any line-oriented functionality in NIO. how can i preserve line integrity when reading from and writing to byte or char buffers?
of course, i may be completely on the wrong track, and NIO may not be the answer, too.
thanks very much for any help. i've spent most of my saturday whipping this horse, and it's just not moving. ;-)
mp
If you have to ask what jazz is, you'll never know. --Louis Armstrong