wood burning stoves*
The moose likes I/O and Streams and the fly likes MD5sum performance Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Java » I/O and Streams
Bookmark "MD5sum performance" Watch "MD5sum performance" New topic
Author

MD5sum performance

Arjun Shastry
Ranch Hand

Joined: Mar 13, 2003
Posts: 1874
Hi,
MD5Sum which is built in Linux gives 32 digits hash. We are planning to use this function to compare if file is corrupt.i.e. comparing same file on two machines using Md5. But Md5Sum appears to take long time. Program needs to compare almost 1.5 million files. Is there any efficient way to compare the equality of files?


MH
Tony Docherty
Bartender

Joined: Aug 07, 2007
Posts: 2289
    
  49
It really depends on how accurately you want to compare them for equality. A very rough but very quick approach would be to compare the file sizes whereas a very accurate but very slow approach would be to compare each byte.

Note If two files have the same MD5 hash it does not guarantee the files are the same, it just means that both files hash to the same value. Having said that it's very unlikely you have two different files with the same hash value although it is possible to generate a file that will hash to the same value as an existing file.
Dieter Quickfend
Bartender

Joined: Aug 06, 2010
Posts: 543
    
    4

you mean you're generating your MD5's at compare time? If you'd generated them when saving the files, you wouldn't be in this mess. Change your save code to save a checksum and run a script to save all current files' checksums, then just compare the checksums at runtime, and you're fine.

Anywho, try spawning as many threads as possible comparing files, as long as your garbage collector isn't taking more timeslice than your process, you can optimize speed. The more detail in comparison you want, the slower it's going to get.

As Tony said, file size comparison would be quite fast. If you do size comparison and on same-sized files you compare bytes, that would likely be a very fast way to go.


Oracle Certified Professional: Java SE 6 Programmer && Oracle Certified Expert: (JEE 6 Web Component Developer && JEE 6 EJB Developer)
Tony Docherty
Bartender

Joined: Aug 07, 2007
Posts: 2289
    
  49
Dieter Quickfend wrote:Anywho, try spawning as many threads as possible comparing files, as long as your garbage collector isn't taking more timeslice than your process, you can optimize speed. The more detail in comparison you want, the slower it's going to get.

I wouldn't go mad creating loads and loads of threads, you are likely to very quickly become bogged down in I/O blocks unless the files are on many different drives. You need to do some serious timing tests to see how many threads is optimum for your system before just throwing threads at the problem.
Arjun Shastry
Ranch Hand

Joined: Mar 13, 2003
Posts: 1874
Thanks. Creating many threads is one option but need to see its effect on performance. Storing MD5SUM/total bytes in cache/some table and comparing it during run time might also be another option i m thinking of.
 
Consider Paul's rocket mass heater.
 
subject: MD5sum performance