File APIs for Java Developers
Manipulate DOC, XLS, PPT, PDF and many others from your application.
http://aspose.com/file-tools
The moose likes JDBC and the fly likes Finding Duplicate Records Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Databases » JDBC
Bookmark "Finding Duplicate Records" Watch "Finding Duplicate Records" New topic
Author

Finding Duplicate Records

Corey McGlone
Ranch Hand

Joined: Dec 20, 2001
Posts: 3271
I'm trying to clean up a database table which contains duplicate records. I want to put in a constraint so that we won't have duplicates in the future but, before I can put the constraint in place, I need to remove all the existing duplicates. I have a table that looks like this:



In this case, I'd like to locate the records that look like this:



How can I write a query to do this? I'm a little stumped on the SQL. Also, is there an easy way to remove just all duplicates so I'm left with only one, or is that best done manually?

Thanks.


SCJP Tipline, etc.
Jeanne Boyarsky
author & internet detective
Marshal

Joined: May 26, 2003
Posts: 31077
    
233

Corey,
The easiest way I can think of is to copy all the records to a temp table and then re-insert them using select distinct.

While I can imagine a way of identifying duplicates (using group by and having), I can't think of a way to delete them in a non database specific way. Of course, you probably don't care about that as you are dealing with a specific database. Which one?

I'm curious if anyone has a reply that doesn't involve a temp table.


[Blog] [JavaRanch FAQ] [How To Ask Questions The Smart Way] [Book Promos]
Blogging on Certs: SCEA Part 1, Part 2 & 3, Core Spring 3, OCAJP, OCPJP beta, TOGAF part 1 and part 2
Paul Campbell
Ranch Hand

Joined: Oct 06, 2007
Posts: 338
Originally posted by Jeanne Boyarsky:
Corey,
The easiest way I can think of is to copy all the records to a temp table and then re-insert them using select distinct.

While I can imagine a way of identifying duplicates (using group by and having), I can't think of a way to delete them in a non database specific way. Of course, you probably don't care about that as you are dealing with a specific database. Which one?

I'm curious if anyone has a reply that doesn't involve a temp table.


ahh but Jeanne... where's the challenge in that? :p

Use a subquery with an aggregate function such as MIN to arbitrarily choose the ID to retain:

delete from mytable
where myPK not in ( select min(myPK)
from mytable
group by SomeID )
[ November 27, 2007: Message edited by: Paul Campbell ]
Corey McGlone
Ranch Hand

Joined: Dec 20, 2001
Posts: 3271
Originally posted by Paul Campbell:
delete from mytable
where myPK not in ( select min(myPK)
from mytable
group by SomeID )


That worked great, Paul, and after seeing what you presented, I'm surprised I hadn't thought of it myself - I had done something very similar in another query I had written. Thanks for the help.
Paul Campbell
Ranch Hand

Joined: Oct 06, 2007
Posts: 338
you're welcome... it is one of those solutions that a lot of very, very smart people find hard to visualize when they first encounter it since it requires you to think counter from how most of us solve problems.
steve souza
Ranch Hand

Joined: Jun 26, 2002
Posts: 861
Just wanted to make a further point for anyone that may read the post. It is usually easy enough to write a query to delete the duplicates, however in most cases you don't simply want to delete a record based on an aggregate function.

It often can require analysis to know which of the dupes should be removed. For example you may have the same product several times with all different prices. The one to keep is the one with the right price. That often takes analysis. God help you if you have many dupes in a large table.

That is why it is so important to not allow duplicates in the first place.


http://www.jamonapi.com/ - a fast, free open source performance tuning api.
JavaRanch Performance FAQ
Paul Campbell
Ranch Hand

Joined: Oct 06, 2007
Posts: 338
Originally posted by steve souza:
Just wanted to make a further point for anyone that may read the post. It is usually easy enough to write a query to delete the duplicates, however in most cases you don't simply want to delete a record based on an aggregate function.

It often can require analysis to know which of the dupes should be removed. For example you may have the same product several times with all different prices. The one to keep is the one with the right price. That often takes analysis. God help you if you have many dupes in a large table.

That is why it is so important to not allow duplicates in the first place.


Steve, the example is only intended solve the problem of duplicate rows. It isn't intended to solve the problem of duplicated identifiers or all possible instances of data duplication... though the logic process would likely be the same in your example... you identify what you're keeping and delete the rest.
Jeanne Boyarsky
author & internet detective
Marshal

Joined: May 26, 2003
Posts: 31077
    
233

Originally posted by Paul Campbell:
you're welcome... it is one of those solutions that a lot of very, very smart people find hard to visualize when they first encounter it since it requires you to think counter from how most of us solve problems.

But I bet they then remember it forever. That's a great solution. I look forward to citing it in the future
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: Finding Duplicate Records