• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Jeanne Boyarsky
  • Ron McLeod
  • Paul Clapham
  • Liutauras Vilda
Sheriffs:
  • paul wheaton
  • Rob Spoor
  • Devaka Cooray
Saloon Keepers:
  • Stephan van Hulst
  • Tim Holloway
  • Carey Brown
  • Frits Walraven
  • Tim Moores
Bartenders:
  • Mikalai Zaikin

MD5 Generation of large files

 
Ranch Hand
Posts: 34
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Hi out there,
I have to test if two files on two different systems are identical. I thought MD5 would be the right way for this.
Well, all works fine but generating a MD5-Hash for very large files is not very performant (you can drink a lot of coffee during that operation:-)).
OK my little question is if I generate the Hash only for example for the first 1024 (or 4096 or ...) Bytes of a file do you think this would be a unique hash than?
Thanx a lot
Bye
Mark
 
Author and all-around good cowpoke
Posts: 13078
6
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Don't be silly, you could never be SURE the files are identical if you don't do a digest of the entire file.
However - you could efficiently decide the files are NOT identical if digests of selected parts are NOT identical. Which parts to choose would depend on where the files come from.
Naturally you are comparing the file lengths first, right?
Bill
 
Mark Mescher
Ranch Hand
Posts: 34
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Hi,
sure I compare filename and length first, after that the md5. As I see a Hash of the first 1024 Bytes would be enough to be nearly unique, or not?
Mark
 
author
Posts: 14112
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
A hash is *never* unique. MD5 is done in a way that it's seen as impossible to *deliberately* create two files with the same hash, but there can't be a guarantee that two different files will have a different hash (after all, there are much more possible file contents than possible hash values).

So the only *reliable* way to compare two files is to do it byte by byte. Only if calculating and comparing a hash is much faster (because the bytes need to be send over a slow network, for example), it makes sense to first do a hash compare, and then only do a byte by byte compare if the hashs are equal.

If you are working locally, a byte to byte compare will be much faster than comparing the hash, anyway.
[ October 25, 2004: Message edited by: Ilja Preuss ]
 
With a little knowledge, a cast iron skillet is non-stick and lasts a lifetime.
reply
    Bookmark Topic Watch Topic
  • New Topic