I Googled that a while ago. How do you remove duplicate rows from a SQL database? (MySQL specifically.)
If you have a ‘simple’ duplicate setup, it’s pretty easy, and lots of other sites go into great detail. (And in that case, you might look into how to require that fields be unique so you don’t have that problem anymore…)
In our case, we ran into an obscure problem where we expected the combination of two fields to be unique. Without publicizing anything secret, we had a summary table of user_id and month. (Not that it’s actually anything secret, but I refer to the table I did this on as “tablename,” just since blogging about our production database schemas seems like a bad idea…) A brief bug introduced some duplicates. My task? Remove the duplicate rows.
So I went about it somewhat backwards, and started wondering how do a SELECT DISTINCT across two columns. You can’t, really. Someone recommended something clever: CONCAT() across the two field names, and a DISTINCT on that: SELECT DISTINCT(CONCAT(user_id, month))… After toying with that a bit, I was able to create a temporary table housing a de-duped version of our data:
create temporary table nodupes
SELECT DISTINCT(CONCAT(user_id, month)) asdf, tablename.*
FROM tablename GROUP BY asdf;
But how does that help? The problem with temporary tables is that they’re, uhh, temporary. If you lived on the edge, you could probably “DROP TABLE original_table; CREATE TABLE original_table (SELECT * FROM temp_table)”, but this has a few horrible problems. One is that you’re throwing away all your disk-based data so you can populate it with a version that exists only in RAM. If anything went wrong, you’d lose the data. Furthermore, in our case, this is a live, busy table. Dropping it in production is awful, and it would take many minutes to repopulate it since it’s a huge table.
So then it was clear what I wanted to do: something like DELETE FROM original_table WHERE id NOT IN(SELECT * FROM temp_table_with_no_dupes), except not using an IN() on a table with hundreds of thousands of rows. And this is where it got tricky. The key is to use that temporary table to derive a new temporary table, housing only the duplicates. (Actually, speaking of “the key,” create one in your temporary table: CREATE INDEX gofast ON nodupes(id); — you’ll need this soon, as it’s the difference between the next query taking 10 seconds and me aborting it after 30 minutes.)
The temporary table of duplicate rows is basically derived by joining your ‘real’ table with the list of de-duped tables, with a rather obscure trick: joining where id=NULL. So here’s what I did:
CREATE TEMPORARY TABLE dupes
LEFT OUTER JOIN nodupes nd
ON nd.id=tablename.id WHERE nd.id IS NULL);
For bonus points, we’re selecting from the real table, so the silly CONCAT() field doesn’t come through into this temporary table.
So once again, I momentarily flashed a sly grin. This is going to be easy! DELETE FROM tablename WHERE id IN(SELECT id FROM dupes);. But then my sly grin faded. It turns out that there are two problems here. The first is that there are still a lot of rows. In my case, I was doing an IN() with about 6,000 rows. MySQL doesn’t like that. But there’s a bigger problem. (Thanks to Andrew for confirming that this is a real bug: #9090.) MySQL doesn’t use indexes when you use a subquery like that. It seems the bug is pretty narrow in scope, but this is precisely it. So that query is awful, basically doing 6,000 full table scans. And on a really big table.
So the solution comes from something I’ve never even thought about doing before. MySQL calls it the “multi-delete.” I call it the, “Holy smokes, this could go really, really wrong if I’m not careful” delete.
-- Don't run this yet
FROM tablename AS bb
RIGHT JOIN dupes d
That’s the command you want, but a few caveats. First, change the first line to “SELECT *” and run it to see what happens. Make sure it’s only grabbing the duplicate rows. You’re not able to use a LIMIT in a multidelete query, so you have to be extremely careful here. Second, we tested this multiple times on various test databases before doing it against our master database, and I was still uneasy when it came time to hit enter. Keep backups before you do this. And finally, in many cases, MySQL will expect an explicit “USE database_name” before the multi-delete, where database_name is the database housing the table you’re deleting from. Keep that in mind if you get an error that it can’t find the table, even if you explicitly reference it (via database_name.table_name).