I have file (file 1) with 47 million lines in it. I also have a file (file 2) that is the same as file 1 except it differs from file 1 in the following ways: File 2 could have new lines added File 2 could be missing lines that are in file 1 I have to do two things: 1. Efficiently take all file 1 lines and insert them into a database. This is somewhat straight forward but any thoughts on rapidly doing this are welcome. 2. Find all the lines in File 1 that are missing in File 2. Can you help figure out the fastest way to perform this search without reading both files into a database and doing a some sort diff on the tables? I would prefer to do this before doing an insert into the database. Thanks for all your help and I'm looking forward to your responses. -- Rodney
Are the lines sorted in any way? If not, you'll probably have to put everyting into a database, since there's no way to tell if a given line is really missing, or just in a different location, unless you read the whole file. If the lines are sorted somehow, you can keep two readers open and read them line by line, switching from one reader to another to keep them roughly in sync. E.g. if file 1 has A B C D E G and file 2 has A B D E F G you can do something like this read 1: A read 2: A read 1: B read 2: B read 1: C read 2: D - missing C in file 2 detected; read 1 up to D read 1: D - caught up read 1: E read 2: E read 1: G read 2: F - missing F in file 1 detected; read 2 up to G read 2: G - caught up The logic may take some thought to code right, but it's certainly doable, and much faster than searching a DB for each line. But it only works if you have some way of knowing when you've read too far in one file.