• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Ron McLeod
  • Paul Clapham
  • Bear Bibeault
  • Junilu Lacar
Sheriffs:
  • Jeanne Boyarsky
  • Tim Cooke
  • Henry Wong
Saloon Keepers:
  • Tim Moores
  • Stephan van Hulst
  • Tim Holloway
  • salvin francis
  • Frits Walraven
Bartenders:
  • Scott Selikoff
  • Piet Souris
  • Carey Brown

Comparing two huge files

 
Ranch Hand
Posts: 182
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi Guys
I have 2 files (worst case - each 20 MB). Suppose file 1 : test1.txt and file 2 : test2.txt

Now, i need to read test1.txt and compare the entries in test2.txt. If text of test1.txt not found in test2.txt, i need to process else ignore.
I have been using a BufferedReader to read from the files and am building a ArrayList out of each file. My code is as below:



I am getting OutOfMemoryError. Now, i am thinking of reading test1.txt in parts, storing it in a java Object (may b ArrayList) and than having a seperate loop wherein i read test2.txt and compare the text with the ArrayList of test1.txt(partly read file). However i want to know how can i resume reading test1.txt from the location where i had stopped reading last time

pseudo functionality :

read partly from test1.txt
store this part in a java Object
while reading test2.txt, compare the text whether it is present in the java object from above.

resume reading next part of test1.txt
-- again open test2 to read.



Guys, i know this may not be an optimized way. Kindly help.
 
Ranch Hand
Posts: 1646
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
This is a pretty common problem. One fairly simple solution (minimize your effort -- usually a good thing) is to use an external utility to sort each file to a new file. If you're tight on RAM, do one at a time rather than in parallel. Once that's complete, performing a "diff" is a snap.

Here's an example of two sorted files, using single letters rather than whole lines though there is no difference.The algorithm is slightly different if you can make certain asumptions like "Some lines in F1 aren't in F2 (must be processed), but every line in F2 is in F1". In other words, F2 is a subset of F1. Once you understand the general case, however, handling assumptions is easy.

I'll put it into words instead of psuedocode to provide a challenge.

Logically, you have a pointer into each file to a "current line". They both start at the first line of their file. In code, this entails simply a BufferedReader to maintain the pointer and a currentLine to hold the line just read. I'll refer to the lines as L1 and L2.

Do the following until you run out of lines. Compare the two lines using String.compareTo(). [note: does case matter? whitespace?]
  • If they are equal, read the next line for both files and continue the loop.
  • If L1 < L2 (sorts before it), F2 doesn't contain L1. Process L1 and read the next line from F1 only.
  • By necessity L1 > L2, meaning F1 doesn't contain L2. Do whatever you feel is necessary to L2 (flog, shread, decompile) and then read the next line from F2.

  • Give that a shot, and let me know if I glossed over too much.
     
    (instanceof Sidekick)
    Posts: 8791
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    The algorithm David mentioned for sorted inputs is one of my favorites dating back to my first excercise in training. In the days of tape data storage it was often used in master-update programs, one file is the master file, the other is transactions.

    If you cannot sort the files, say you are comparing Java sources or you have to detect MOVE as well as ADD or DELETE changes, you can look at the algorithm I used for Text Diff and see if you could find a way to NOT store every line in memory. Some thoughts ...

    Store hashcode, offset and length in the file for each line instead of storing the text for each line. This could get you some false positive matches so you might have to use random access to re-read the line and compare when you get hashcode matches.

    Build the data structures in a database instead of in memory. Sure to be much slower!

    ---
    Editing: Just remembered I have another compare algorithm in a drawer (20 year old listing) that brings some number of lines into memory at a time. It can detect ADD, MOVE, DELETE actions within "n" lines, but not on broader scales. If your data has relatively localized differences that might work. Let me know and I can go look for it.

    Let us know what you work out!
    [ May 04, 2005: Message edited by: Stan James ]
     
    Ranch Hand
    Posts: 53
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    plz help
    i need to compair tow files without saving ,just know if thay equal or not files content dont need to be in same order??how could i??
     
    nana Amr
    Ranch Hand
    Posts: 53
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    there is a method to compair files:
    public int compareTo(File pathname)

    plz i dont know how to use it ?
     
    Ranch Hand
    Posts: 1071
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    That method only compares file names.

    You've been given a couple of solutions already. The horribly performing brute force method would be:
     
    nana Amr
    Ranch Hand
    Posts: 53
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    if the file content are equals but in diffrent order or contain spaces or "\n" dose it work?
     
    Steven Bell
    Ranch Hand
    Posts: 1071
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    Considering that is sudocode and I've put no definition on what counts as a matching line, yes it would work. However it is the worst possible implementation in terms of performance.
     
    nana Amr
    Ranch Hand
    Posts: 53
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    should i use FileReader/FileWriter and BufferedReader/BufferedWriter?
     
    Stan James
    (instanceof Sidekick)
    Posts: 8791
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    Buffered fer sure. That does fewer and larger physical IOs.
     
    A teeny tiny vulgar attempt to get you to buy our stuff
    Building a Better World in your Backyard by Paul Wheaton and Shawn Klassen-Koop
    https://coderanch.com/wiki/718759/books/Building-World-Backyard-Paul-Wheaton
      Bookmark Topic Watch Topic
    • New Topic