• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Ron McLeod
  • Paul Clapham
  • Devaka Cooray
  • Liutauras Vilda
Sheriffs:
  • Jeanne Boyarsky
  • paul wheaton
  • Henry Wong
Saloon Keepers:
  • Stephan van Hulst
  • Tim Holloway
  • Tim Moores
  • Carey Brown
  • Mikalai Zaikin
Bartenders:
  • Lou Hamers
  • Piet Souris
  • Frits Walraven

NIO with large files

 
Greenhorn
Posts: 10
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
background:
i have to do some processing of large text files -- 1 GB and larger, and i'll be processing as many as 12 in a row. "larger": i have one file that is 3.5 GB. something like, going through a directory of these files and processing each one.

i'm trying to work out how i can do this efficiently, and NIO seems like the good bet. i already have written an application to process these files using standard i/o. it pegs my cpu and takes a very long time. that wasn't a showstopper for an occasional single file, but now i have to do it in a production environment daily, for a dozen or more files.

my "processing" is mostly filtering (examine the contents of a field and throw out the lines that contain specific data), and occasionally swapping the contents of two fields. therefore, my feeling is that i/o is causing the performance crisis.

therefore, my thought was to memory-map the files in chunks, in read and write buffers, and do the processing between these maps. from my reading, my understanding is that changes in the write buffer are written back to the disk copy automatically.

problem:
after spending a good number of hours today, reading about NIO through online resources, i can't figure out a design for accomplishing this. there's lots of high-level overviews with code snippets, but no practical problem-solving demonstrations with NIO.

1. can anyone suggest a good design for accomplishing my task? i understand that i need to open a channel on a FileInputStream to read and on a FileOutputStream to write, but i don't understand how to set up a buffer that i can process line-by-line. i have to process the files line by line.

2. how do i read the file in chunks? it's not clear to me that i have a pointer in the disk copy that functions like a cursor in a record set. i know i have a 'position' pointer in the buffer. from the file-reading standpoint, it would seem to make most sense to simply reuse the same buffer.

2. how significant is the size of the buffer? i tried creating a 200MB buffer, but it blew up the VM:



i have no notion of how much buffer is enough and i found no discussion of this topic.

3. i can't find any line-oriented functionality in NIO. how can i preserve line integrity when reading from and writing to byte or char buffers?

of course, i may be completely on the wrong track, and NIO may not be the answer, too.

thanks very much for any help. i've spent most of my saturday whipping this horse, and it's just not moving. ;-)

mp
 
Bartender
Posts: 9626
16
Mac OS X Linux Windows
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
NIO doesn't have a performance advantage over stream-based IO (i.e. java.io.*). When NIO was introduced, the stream-based classes were re-implemented to use NIO under the covers. What NIO gives us is features not available in stream-based IO (see this article for examples).
I don't know NIO well enough to answer your questions, and NIO hasn't really won enough mindshare to get a lot of feedback on this forum. I recommend that people who want to use NIO get a book.
I can say that creating a 200 MB buffer is doomed to failure. The default maximum heap size is 64MB. You need to use command-line options to create a larger heap (-Xms, -Xmx).
Make sure that your hardware is up to the task. Moving 1GB of data is slow enough. If you don't have fast disks, lots of memory and a fast CPU, optimizing code isn't going to get you anything.
 
She still doesn't approve of my superhero lifestyle. Or this shameless plug:
We need your help - Coderanch server fundraiser
https://coderanch.com/wiki/782867/Coderanch-server-fundraiser
reply
    Bookmark Topic Watch Topic
  • New Topic