File APIs for Java Developers
Manipulate DOC, XLS, PPT, PDF and many others from your application.
The moose likes JDBC and Relational Databases and the fly likes SQL puzzle about removing duplicate rows Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login

Win a copy of REST with Spring (video course) this week in the Spring forum!
JavaRanch » Java Forums » Databases » JDBC and Relational Databases
Bookmark "SQL puzzle about removing duplicate rows" Watch "SQL puzzle about removing duplicate rows" New topic

SQL puzzle about removing duplicate rows

Frank Carver

Joined: Jan 07, 1999
Posts: 6920
I currently find myself in a situation where one of my database tables has ended up with a *lot* of duplicate rows - over 14 million where there should be more like a few thousand . I want to add a compound key constraint to the table to prevent this happening in future, but the database won't let me because of the existing duplicates.

Does anyone have any suggestions for clever SQL or DDL techniques to remove duplicate rows, leaving just one of each set of duplicates?

I'm using MySQL 5.0.27 if it helps, and the table in question has three columns:

create table retailitem_tag (
parent varchar(64),
child varchar(255),
sequence int

I want to add primary key(parent,child) so any rows with the same parent and child values should be considered as duplicates.

All suggestions welcome!

Read about me at ~ Raspberry Alpha Omega ~ Frank's Punchbarrel Blog
Christophe Verré

Joined: Nov 24, 2005
Posts: 14688

If you want to keep to biggest sequence for each (parent, child), you could make a temp table and put only one record of each :
(the following is only an idea, I don't know if it works on MySql)
insert into retailitem_tag_temp select parent, child, max(sequence) from retailitem_tag group by parent, child.

Then you truncate retailitem_tag (after making a backup, just in case), and insert the temp into retailitem_tag. Or rename the temp into retailitem_tag.

There may be more clever to do, but I think this once should be ok. What do you think about it ?

[My Blog]
All roads lead to JavaRanch
Ernest Friedman-Hill
author and iconoclast

Joined: Jul 08, 2003
Posts: 24195

Assuming "sequence" is a unique id, then the solution just looks like

this keeps the row with the lowest sequence value for each parent/child combination.

[Jess in Action][AskingGoodQuestions]
Christophe Verré

Joined: Nov 24, 2005
Posts: 14688

The problem with an immediate delete is that it could raise a rollback error (run out of space). Frank said that there were 14 million records. But it's worth giving it a try.
Frank Carver

Joined: Jan 07, 1999
Posts: 6920
Thanks for the replies folks. After some web searching I found a few ways of achieving what I wanted, so I thought I'd report them here just in case anyone else might need to know.

The first approach is a clever way of removing duplicate rows:

This creates a whole new temporary table containing only the distinct rows from the one table with duplicates, then cleans the original table and fills it from the temporary one.

It's pretty good, and quick, but it does not quite go all the way to what I need. It leaves rows with the same parent and child but a different sequence, so it still won't let me add a composite primary key on parent and child.

To remove these duplicates then some of the techniques suggested above would work.

However, In the end I went with a considerably simpler solution. MySQL supports an "ignore" indicator on "alter table" which does just what I want.

This happily (and silently) eliminated duplicates, leaving only one of each, and applied the key constraint at the same time. Excellent.

I don't know how applicable this is to other databases, but it certainly saved my bacon
I agree. Here's the link:
subject: SQL puzzle about removing duplicate rows
It's not a secret anymore!