MD5Sum which is built in Linux gives 32 digits hash. We are planning to use this function to compare if file is corrupt.i.e. comparing same file on two machines using Md5. But Md5Sum appears to take long time. Program needs to compare almost 1.5 million files. Is there any efficient way to compare the equality of files?
It really depends on how accurately you want to compare them for equality. A very rough but very quick approach would be to compare the file sizes whereas a very accurate but very slow approach would be to compare each byte.
Note If two files have the same MD5 hash it does not guarantee the files are the same, it just means that both files hash to the same value. Having said that it's very unlikely you have two different files with the same hash value although it is possible to generate a file that will hash to the same value as an existing file.
you mean you're generating your MD5's at compare time? If you'd generated them when saving the files, you wouldn't be in this mess. Change your save code to save a checksum and run a script to save all current files' checksums, then just compare the checksums at runtime, and you're fine.
Anywho, try spawning as many threads as possible comparing files, as long as your garbage collector isn't taking more timeslice than your process, you can optimize speed. The more detail in comparison you want, the slower it's going to get.
As Tony said, file size comparison would be quite fast. If you do size comparison and on same-sized files you compare bytes, that would likely be a very fast way to go.
Dieter Quickfend wrote:Anywho, try spawning as many threads as possible comparing files, as long as your garbage collector isn't taking more timeslice than your process, you can optimize speed. The more detail in comparison you want, the slower it's going to get.
I wouldn't go mad creating loads and loads of threads, you are likely to very quickly become bogged down in I/O blocks unless the files are on many different drives. You need to do some serious timing tests to see how many threads is optimum for your system before just throwing threads at the problem.
posted 6 years ago
Thanks. Creating many threads is one option but need to see its effect on performance. Storing MD5SUM/total bytes in cache/some table and comparing it during run time might also be another option i m thinking of.
My favorite is a chocolate cupcake with white frosting and tiny ad sprinkles.