I currently find myself in a situation where one of my database tables has ended up with a *lot* of duplicate rows - over 14 million where there should be more like a few thousand . I want to add a compound key constraint to the table to prevent this happening in future, but the database won't let me because of the existing duplicates.
Does anyone have any suggestions for clever SQL or DDL techniques to remove duplicate rows, leaving just one of each set of duplicates?
I'm using MySQL 5.0.27 if it helps, and the table in question has three columns:
If you want to keep to biggest sequence for each (parent, child), you could make a temp table and put only one record of each : (the following is only an idea, I don't know if it works on MySql) insert into retailitem_tag_temp select parent, child, max(sequence) from retailitem_tag group by parent, child.
Then you truncate retailitem_tag (after making a backup, just in case), and insert the temp into retailitem_tag. Or rename the temp into retailitem_tag.
There may be more clever to do, but I think this once should be ok. What do you think about it ?
The problem with an immediate delete is that it could raise a rollback error (run out of space). Frank said that there were 14 million records. But it's worth giving it a try.
Joined: Jan 07, 1999
Thanks for the replies folks. After some web searching I found a few ways of achieving what I wanted, so I thought I'd report them here just in case anyone else might need to know.
The first approach is a clever way of removing duplicate rows:
This creates a whole new temporary table containing only the distinct rows from the one table with duplicates, then cleans the original table and fills it from the temporary one.
It's pretty good, and quick, but it does not quite go all the way to what I need. It leaves rows with the same parent and child but a different sequence, so it still won't let me add a composite primary key on parent and child.
To remove these duplicates then some of the techniques suggested above would work.
However, In the end I went with a considerably simpler solution. MySQL supports an "ignore" indicator on "alter table" which does just what I want.
This happily (and silently) eliminated duplicates, leaving only one of each, and applied the key constraint at the same time. Excellent.
I don't know how applicable this is to other databases, but it certainly saved my bacon