Win a copy of The Java Performance Companion this week in the Performance forum!
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

Rapid IO(I/O) and Search

 
Rodney Woodruff
Ranch Hand
Posts: 80
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I have file (file 1) with 47 million lines in it. I also have a file (file 2) that is the same as file 1 except it differs from file 1 in the following ways:
File 2 could have new lines added
File 2 could be missing lines that are in file 1
I have to do two things:
1. Efficiently take all file 1 lines and insert them into a database. This is somewhat straight forward but any thoughts on rapidly doing this are welcome.
2. Find all the lines in File 1 that are missing in File 2. Can you help figure out the fastest way to perform this search without reading both files into a database and doing a some sort diff on the tables? I would prefer to do this before doing an insert into the database.
Thanks for all your help and I'm looking forward to your responses.
-- Rodney
 
Jim Yingst
Wanderer
Sheriff
Posts: 18671
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Are the lines sorted in any way? If not, you'll probably have to put everyting into a database, since there's no way to tell if a given line is really missing, or just in a different location, unless you read the whole file.
If the lines are sorted somehow, you can keep two readers open and read them line by line, switching from one reader to another to keep them roughly in sync. E.g. if file 1 has
A
B
C
D
E
G
and file 2 has
A
B
D
E
F
G
you can do something like this
read 1: A
read 2: A
read 1: B
read 2: B
read 1: C
read 2: D - missing C in file 2 detected; read 1 up to D
read 1: D - caught up
read 1: E
read 2: E
read 1: G
read 2: F - missing F in file 1 detected; read 2 up to G
read 2: G - caught up
The logic may take some thought to code right, but it's certainly doable, and much faster than searching a DB for each line. But it only works if you have some way of knowing when you've read too far in one file.
 
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic