my dog learned polymorphism
The moose likes Java in General and the fly likes Duplicate Files Remover Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login

Win a copy of Java Interview Guide this week in the Jobs Discussion forum!
JavaRanch » Java Forums » Java » Java in General
Bookmark "Duplicate Files Remover" Watch "Duplicate Files Remover" New topic

Duplicate Files Remover

Imad Ali

Joined: Jan 04, 2009
Posts: 21
first of all LOL at this forum

Anyway, Im interested in GUI tweaking, mainly user friendliness and related principles.

I need to demonstrate this on a duplicate files remover/scanner program

Ive done some work into it. Some information you need to know if you want to help me is:
The program needs to be a Java application
It will run in a JRE
its a standalone program

Ok, let me go through the intended or expected end user phases:
User loads program, its comes on screen
User clicks can, application then scans local disk
Files are deleted automatically (hopefully displated)

Please let me in on some information about how to better incorporate a Md5 checksum to a GUI

I want to learn my exaple or snippet ideally, nothing too complicted please because i want to know exactly what each code chunk does.


Rob Spoor

Joined: Oct 27, 2005
Posts: 20279

First thing is determining how you can figure out file equality. A basic approach is checking the file length first; if they are equal compare the full contents.

You can use File.listFiles for browsing through your hard disk.

Now I've done something similar, and here's my approach:
- use a Map<Long,List><File>> that stores the unique files per file size.
- when you encounter a file, get the List<File> for its length.
- compare the file with all elements of the List<File> (if not null)
- if it is equal to any of the files process it, otherwise add it to the list (create it and add it to the list if it was null)

The latter part in pseudo code:

How To Ask Questions How To Answer Questions
William Brogden
Author and all-around good cowpoke

Joined: Mar 22, 2000
Posts: 13037
Please let me in on some information about how to better incorporate a Md5 checksum to a GUI

I dont see what the use of MD5 or any other checksum to determine file equality has to do with a GUI. Perhaps you could have a dialog which gives a choice of equality checking methods but thats about it. Surely nobody needs to know the numeric value.

Richard Tookey

Joined: Aug 27, 2012
Posts: 1166

Many years ago I wrote a program to deal with duplicate files on a disk and I still use it. My basic approach is to process a collection of 'roots' that will be scanned looking for duplicates. Starting at the roots I recursively visit the file tree creating a map using the SHA1 digest (MD5 will do just as well) of the file content as key with a Set of file names as values. Files with the same content will produce the same digest but of course it is possible that two files with different content will also have the same digest BUT in the 10 years or so I have used the program it has never found two different files with the same digest.

In my first version I too wanted to automatically delete duplicates but I soon found that there are difficulties with doing this. First, given two or more files that have the same content which one(s) do you delete? Second, especially in HTML, it is frequently better to have duplicate files than to try to cross link the HTML sources. To get round these problems I allow the user to specify the minimum file length to consider and to be presented with a list of duplicate files and the user selects which one(s), if any, to delete.

Obviously taking the SHA1 digest means one has in principle to read the whole file. I found this to be unnecessary and ended up taking the digest of the first 1,000,000 bytes. I do have a paranoid check; if I find two file with the same digest I then check for absolute equality of the whole of the file content. This file content comparison normally takes very little time since one first check for the same length and only if the two files are not the same length does one go further.

The GUI is not complicated. Just a file selection system to select the roots, a JCombo size selector and a JTable to present the results. On selecting a duplicate one is given the option to delete it or move it to a backup area.

Carey Brown
Ranch Hand

Joined: Nov 19, 2001
Posts: 893

Basing your comparison on checksums has two problems: 1) it is possible (though remotely) that two files with the same checksum are not identical, and 2) computing checksums takes far longer than doing a byte-by-byte comparison because a byte-by-byte comparison can bail out as soon as two bytes are not the same, it doesn't (usually) need to read the entire file like the checksum approach would.

I created two utilities for myself, one where you specify two directory roots, one for comparison and one for deletion, and another program that takes a list of one or more roots but presents the list of duplicates to the user to identify which ones to delete.

One of the gui parameters in both programs is the path requirement: (none) don't care what the path is, (file) where the file names must match, and (dir/file) where both the file name and its parent directory name must match, etc..

Understanding the scope of the problem is the first step on the path to true panic
Winston Gutkowski

Joined: Mar 17, 2011
Posts: 8948

Carey Brown wrote:Basing your comparison on checksums has two problems: 1) it is possible (though remotely) that two files with the same checksum are not identical, and 2) computing checksums takes far longer than doing a byte-by-byte comparison...

Actually, not by much, since they generally only involve maths or binary operations on the bytes/characters in sequence, which is likely to be far quicker than actually reading them.

And I'd add a third problem: It's tough to know if anyone else is updating one of your files while you're doing the check. If Java has that sort of capability, I've never heard about it - or used it.

It may also be worth pointing out that this thread is from 2009; so I suspect Elvis has left the building.


Bats fly at night, 'cause they aren't we. And if we tried, we'd hit a tree -- Ogden Nash (or should've been).
Articles by Winston can be found here
Ivan Jozsef Balazs

Joined: May 22, 2012
Posts: 970
Some time ago I also wrote such a program and am still using it.
Mine has no GUI but a CUI: it takes as arguments the directory names to peruse.
It uses md5 digest and before deleting a file it does a byte-for-byte check against the one to keep.

The question of which one to keep arose naturally.
The program keeps the files in the file system Z: which is the NAS device on all computers in my modest local network,
and favors to keep the file with the longer path name, that is, the one in a more elaborated directory.

Once wifey's computer's hard disk became defective and unusable by the operating system, and good old TestDisk and PhotoRec
(merci beaucoup, Christophe GRENIER!) saved the very files but without the directory structure.
Then I learned that she had moved the files into an elaborate directory scheme, the loss of which was painful for her,
even though the files got rescued. The whole operation took several days (including nights for the program to run).
I agree. Here's the link:
subject: Duplicate Files Remover
It's not a secret anymore!