This week's book giveaway is in the OO, Patterns, UML and Refactoring forum. We're giving away four copies of Refactoring for Software Design Smells: Managing Technical Debt and have Girish Suryanarayana, Ganesh Samarthyam & Tushar Sharma on-line! See this thread for details.
background: i have to do some processing of large text files -- 1 GB and larger, and i'll be processing as many as 12 in a row. "larger": i have one file that is 3.5 GB. something like, going through a directory of these files and processing each one.
i'm trying to work out how i can do this efficiently, and NIO seems like the good bet. i already have written an application to process these files using standard i/o. it pegs my cpu and takes a very long time. that wasn't a showstopper for an occasional single file, but now i have to do it in a production environment daily, for a dozen or more files.
my "processing" is mostly filtering (examine the contents of a field and throw out the lines that contain specific data), and occasionally swapping the contents of two fields. therefore, my feeling is that i/o is causing the performance crisis.
therefore, my thought was to memory-map the files in chunks, in read and write buffers, and do the processing between these maps. from my reading, my understanding is that changes in the write buffer are written back to the disk copy automatically.
problem: after spending a good number of hours today, reading about NIO through online resources, i can't figure out a design for accomplishing this. there's lots of high-level overviews with code snippets, but no practical problem-solving demonstrations with NIO.
1. can anyone suggest a good design for accomplishing my task? i understand that i need to open a channel on a FileInputStream to read and on a FileOutputStream to write, but i don't understand how to set up a buffer that i can process line-by-line. i have to process the files line by line.
2. how do i read the file in chunks? it's not clear to me that i have a pointer in the disk copy that functions like a cursor in a record set. i know i have a 'position' pointer in the buffer. from the file-reading standpoint, it would seem to make most sense to simply reuse the same buffer.
2. how significant is the size of the buffer? i tried creating a 200MB buffer, but it blew up the VM:
i have no notion of how much buffer is enough and i found no discussion of this topic.
3. i can't find any line-oriented functionality in NIO. how can i preserve line integrity when reading from and writing to byte or char buffers?
of course, i may be completely on the wrong track, and NIO may not be the answer, too.
thanks very much for any help. i've spent most of my saturday whipping this horse, and it's just not moving. ;-)
If you have to ask what jazz is, you'll never know. --Louis Armstrong
NIO doesn't have a performance advantage over stream-based IO (i.e. java.io.*). When NIO was introduced, the stream-based classes were re-implemented to use NIO under the covers. What NIO gives us is features not available in stream-based IO (see this article for examples). I don't know NIO well enough to answer your questions, and NIO hasn't really won enough mindshare to get a lot of feedback on this forum. I recommend that people who want to use NIO get a book. I can say that creating a 200 MB buffer is doomed to failure. The default maximum heap size is 64MB. You need to use command-line options to create a larger heap (-Xms, -Xmx). Make sure that your hardware is up to the task. Moving 1GB of data is slow enough. If you don't have fast disks, lots of memory and a fast CPU, optimizing code isn't going to get you anything.