File APIs for Java Developers
Manipulate DOC, XLS, PPT, PDF and many others from your application.
The moose likes Java in General and the fly likes Rapid IO(I/O) and Search Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Java » Java in General
Bookmark "Rapid IO(I/O) and Search" Watch "Rapid IO(I/O) and Search" New topic

Rapid IO(I/O) and Search

Rodney Woodruff
Ranch Hand

Joined: Dec 04, 2001
Posts: 80
I have file (file 1) with 47 million lines in it. I also have a file (file 2) that is the same as file 1 except it differs from file 1 in the following ways:
File 2 could have new lines added
File 2 could be missing lines that are in file 1
I have to do two things:
1. Efficiently take all file 1 lines and insert them into a database. This is somewhat straight forward but any thoughts on rapidly doing this are welcome.
2. Find all the lines in File 1 that are missing in File 2. Can you help figure out the fastest way to perform this search without reading both files into a database and doing a some sort diff on the tables? I would prefer to do this before doing an insert into the database.
Thanks for all your help and I'm looking forward to your responses.
-- Rodney

Hope This Helps
Jim Yingst

Joined: Jan 30, 2000
Posts: 18671
Are the lines sorted in any way? If not, you'll probably have to put everyting into a database, since there's no way to tell if a given line is really missing, or just in a different location, unless you read the whole file.
If the lines are sorted somehow, you can keep two readers open and read them line by line, switching from one reader to another to keep them roughly in sync. E.g. if file 1 has
and file 2 has
you can do something like this
read 1: A
read 2: A
read 1: B
read 2: B
read 1: C
read 2: D - missing C in file 2 detected; read 1 up to D
read 1: D - caught up
read 1: E
read 2: E
read 1: G
read 2: F - missing F in file 1 detected; read 2 up to G
read 2: G - caught up
The logic may take some thought to code right, but it's certainly doable, and much faster than searching a DB for each line. But it only works if you have some way of knowing when you've read too far in one file.

"I'm not back." - Bill Harding, Twister
I agree. Here's the link:
subject: Rapid IO(I/O) and Search
jQuery in Action, 3rd edition