• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Jeanne Boyarsky
  • Ron McLeod
  • Paul Clapham
  • Liutauras Vilda
Sheriffs:
  • paul wheaton
  • Rob Spoor
  • Devaka Cooray
Saloon Keepers:
  • Stephan van Hulst
  • Tim Holloway
  • Carey Brown
  • Frits Walraven
  • Tim Moores
Bartenders:
  • Mikalai Zaikin

FileChannel, MappedByteByffer, NIO questions

 
Greenhorn
Posts: 17
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I have a fairly large CSV file, say about 40,000+ lines that I get from a client. I also have a CSV file that is an export from a database that is around the same number of lines, as it should be the same data.

I am reading the client file and needing to determine what records need to be updated in the export file. While I have something that works, it is taking way to long to execute because for each line in my client file, I have to search the entire export file for an account number to process that line. Do the math, give or take, the export file is being read way too much.

So I thought that if I were able to index each line with an associated account number in a map I could look up the data rather quickly using some NIO features. However, indexing line numbers of a file has proven difficult. Could someone point in the right direction on where I might find out how to do this?

Thanks.
 
Ranch Hand
Posts: 37
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
If the files have some unique key, merge them into one file and sort them by this key. You may need to add an additional column that tells you from which file a certain row is.

Now read through the combined file linearly. Records that relate to each other are on neighbouring lines. Perform the update, delete the additional column and write out the new export file.

If necessary, sort the export file into its original order.

This takes two sorts and one linear run through the combined file. Should finish faster than you can read this message.-)

Harald.
 
Pat Denton
Greenhorn
Posts: 17
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Ok, here is what I have so far, which works, but I need to make it prettier. This is just a test class so don't rip my horrible coding too bad. I'll clean it up when I figure out what I did.

 
Wanderer
Posts: 18671
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
If you're talking about updating individual records within a CSV file - are the records all the same length? Or at least, all of constant length? If an update requires you to change the length of a record, that will probably result in you having to effectively rewrite the rest of the file, shifting bytes forward or backward as necessary. Which may require you to rethink the whole approach.

How big are the individual records? My first choice here would probably be to try to put the whole thing into a Map with account number as key, and a Record object (containing all the data for that line) as the value. If that can't all fit in memory at once, then Harald's suggestins about sorting the file by account number sound good to me.

Are either of these files currently sorted in any way? Would there be any problem with writing them in a different order after you update?
 
Pat Denton
Greenhorn
Posts: 17
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
What happens is I build a series of batch updates that get sent to the database based on the export and client file.

I get the client file from an external source. My shell scripts runs an export of the current client data prior to executing my loader application. I then must go through each line of the client file, locate the same account in the export file, and determine what, if any, data needs to be issed to the batch update. So I don't have to write to either file. Only read. I can't load either file completely into memory, but it seems that loading 40,000+ account numbers along with an int offset is working ok. Tomorrow I'll have a chance to test it some more. The speed is quite amazing really. I was impressed.

Thanks for the suggestions.
reply
    Bookmark Topic Watch Topic
  • New Topic