Hi All, Can you please help me to find out the solution for checking duplicate file contain? i.e if a file called one.txt and the other file called two.txt contain the same data, then how can we check that? Any Algorithm? Any suggestion? Please response.
I'd probably use a stream of some sort to read byte arrays and compare one byte at a time. Check that the file length is equal first.
If you want to ignore the difference between Unix & Windows newlines you could use a reader instead of a stream and read lines, or just skip all \n and \r when comparing bytes. If you ignore these, the length may not match.
If high speed is a requirement, try both and see which is faster in your own environment. JDK version, OS, disk hardware may make a difference.
A good question is never answered. It is not a bolt to be tightened into place but a seed to be planted and to bear more seed toward the hope of greening the landscape of the idea. John Ciardi
posted 15 years ago
Thanks Stan James . Now I am describing the scenario where I have to check the file duplicity. I upload a file in client site(internet-explorer) and sent to server and again when I upload any file that contain the same date, then at server side some validation should be imposed such that the latter uploaded file can not be processed further.
One solution that I have found is not appropriate. The solution is as following. If I generate the hash-key for this uploaded file from One-Way hash Algorithm and store the hash-key in database and next time when I upload any other file I generate hash-key and compare all hash-key stored in database to confirm that no any file (those were uploaded previously) contain the same data. But it is not appropriate as two different files may generate same hash key.
So any one can please give some suggestion on this issue?
I believe using a hash as you describe is appropriate, but it's not the complete solution. When you upload a new file, you need to check it against all the existing files to see if it duplicates any of them. Using a hash allows you to quickly and efficiently eliminate the vast majority of files, without having to reread them and compare bytes one at a time. If two files have different hashes, they are different, period. However in the case that two hashes are identical, then you probably have to compare bytes in those two files. I would probably use NIO for efficiency:
[ November 18, 2004: Message edited by: Jim Yingst ]
This is what MD5 (Message Digest) is for. See http://java.sun.com/j2se/1.4.2/docs/api/java/security/MessageDigest.html for more details, but in a nutshell, it generates a 16 byte array based on the contents of the data passed into it. The range of possible values means that the likelihood of getting a duplicate MD5 for different files is in the billions to one, if not higher.
If you are working with a small set of files, a Hashtable (or better - a HashMap) should work just fine, using the digest as the key and the value.
Hope this helps, Joseph
P.S. Be sure to convert the array of bytes to a String and use that for the key, so that the Hashtable's hashing works properly.
[ December 03, 2004: Message edited by: Joseph Maddison ] [ December 03, 2004: Message edited by: Joseph Maddison ]
bacon. tiny ad:
Building a Better World in your Backyard by Paul Wheaton and Shawn Klassen-Koop