• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Jeanne Boyarsky
  • Ron McLeod
  • Paul Clapham
  • Liutauras Vilda
Sheriffs:
  • paul wheaton
  • Rob Spoor
  • Devaka Cooray
Saloon Keepers:
  • Stephan van Hulst
  • Tim Holloway
  • Carey Brown
  • Frits Walraven
  • Tim Moores
Bartenders:
  • Mikalai Zaikin

MD5sum performance

 
Ranch Hand
Posts: 1907
1
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Hi,
MD5Sum which is built in Linux gives 32 digits hash. We are planning to use this function to compare if file is corrupt.i.e. comparing same file on two machines using Md5. But Md5Sum appears to take long time. Program needs to compare almost 1.5 million files. Is there any efficient way to compare the equality of files?
 
Bartender
Posts: 3323
86
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
It really depends on how accurately you want to compare them for equality. A very rough but very quick approach would be to compare the file sizes whereas a very accurate but very slow approach would be to compare each byte.

Note If two files have the same MD5 hash it does not guarantee the files are the same, it just means that both files hash to the same value. Having said that it's very unlikely you have two different files with the same hash value although it is possible to generate a file that will hash to the same value as an existing file.
 
Bartender
Posts: 543
4
Netbeans IDE Redhat Java
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
you mean you're generating your MD5's at compare time? If you'd generated them when saving the files, you wouldn't be in this mess. Change your save code to save a checksum and run a script to save all current files' checksums, then just compare the checksums at runtime, and you're fine.

Anywho, try spawning as many threads as possible comparing files, as long as your garbage collector isn't taking more timeslice than your process, you can optimize speed. The more detail in comparison you want, the slower it's going to get.

As Tony said, file size comparison would be quite fast. If you do size comparison and on same-sized files you compare bytes, that would likely be a very fast way to go.
 
Tony Docherty
Bartender
Posts: 3323
86
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Dieter Quickfend wrote:Anywho, try spawning as many threads as possible comparing files, as long as your garbage collector isn't taking more timeslice than your process, you can optimize speed. The more detail in comparison you want, the slower it's going to get.


I wouldn't go mad creating loads and loads of threads, you are likely to very quickly become bogged down in I/O blocks unless the files are on many different drives. You need to do some serious timing tests to see how many threads is optimum for your system before just throwing threads at the problem.
 
Arjun Shastry
Ranch Hand
Posts: 1907
1
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Thanks. Creating many threads is one option but need to see its effect on performance. Storing MD5SUM/total bytes in cache/some table and comparing it during run time might also be another option i m thinking of.
 
With a little knowledge, a cast iron skillet is non-stick and lasts a lifetime.
reply
    Bookmark Topic Watch Topic
  • New Topic