Hi Guys I have 2 files (worst case - each 20 MB). Suppose file 1 : test1.txt and file 2 : test2.txt
Now, i need to read test1.txt and compare the entries in test2.txt. If text of test1.txt not found in test2.txt, i need to process else ignore. I have been using a BufferedReader to read from the files and am building a ArrayList out of each file. My code is as below:
I am getting OutOfMemoryError. Now, i am thinking of reading test1.txt in parts, storing it in a java Object (may b ArrayList) and than having a seperate loop wherein i read test2.txt and compare the text with the ArrayList of test1.txt(partly read file). However i want to know how can i resume reading test1.txt from the location where i had stopped reading last time
pseudo functionality :
read partly from test1.txt store this part in a java Object while reading test2.txt, compare the text whether it is present in the java object from above.
resume reading next part of test1.txt -- again open test2 to read.
Guys, i know this may not be an optimized way. Kindly help.
This is a pretty common problem. One fairly simple solution (minimize your effort -- usually a good thing) is to use an external utility to sort each file to a new file. If you're tight on RAM, do one at a time rather than in parallel. Once that's complete, performing a "diff" is a snap.
Here's an example of two sorted files, using single letters rather than whole lines though there is no difference.The algorithm is slightly different if you can make certain asumptions like "Some lines in F1 aren't in F2 (must be processed), but every line in F2 is in F1". In other words, F2 is a subset of F1. Once you understand the general case, however, handling assumptions is easy.
I'll put it into words instead of psuedocode to provide a challenge.
Logically, you have a pointer into each file to a "current line". They both start at the first line of their file. In code, this entails simply a BufferedReader to maintain the pointer and a currentLine to hold the line just read. I'll refer to the lines as L1 and L2.
Do the following until you run out of lines. Compare the two lines using String.compareTo(). [note: does case matter? whitespace?]
If they are equal, read the next line for both files and continue the loop.
If L1 < L2 (sorts before it), F2 doesn't contain L1. Process L1 and read the next line from F1 only.
By necessity L1 > L2, meaning F1 doesn't contain L2. Do whatever you feel is necessary to L2 (flog, shread, decompile) and then read the next line from F2.
Give that a shot, and let me know if I glossed over too much.
The algorithm David mentioned for sorted inputs is one of my favorites dating back to my first excercise in training. In the days of tape data storage it was often used in master-update programs, one file is the master file, the other is transactions.
If you cannot sort the files, say you are comparing Java sources or you have to detect MOVE as well as ADD or DELETE changes, you can look at the algorithm I used for Text Diff and see if you could find a way to NOT store every line in memory. Some thoughts ...
Store hashcode, offset and length in the file for each line instead of storing the text for each line. This could get you some false positive matches so you might have to use random access to re-read the line and compare when you get hashcode matches.
Build the data structures in a database instead of in memory. Sure to be much slower!
--- Editing: Just remembered I have another compare algorithm in a drawer (20 year old listing) that brings some number of lines into memory at a time. It can detect ADD, MOVE, DELETE actions within "n" lines, but not on broader scales. If your data has relatively localized differences that might work. Let me know and I can go look for it.
Let us know what you work out! [ May 04, 2005: Message edited by: Stan James ]
A good question is never answered. It is not a bolt to be tightened into place but a seed to be planted and to bear more seed toward the hope of greening the landscape of the idea. John Ciardi