Fastest Way to Remove Duplicates

Suppose you had a table that had two columns:

ID (integer)
Sentence (varchar 255)

Due to legacy reasons, the Sentence field had duplicates in it.

What would be the quickest method to get a UNIQUE on the Sentence column as well as remove the duplicates?

The only way I’ve come up with so far is to rebuild the table from scratch with unique on. Since there are over 25 million rows, that’s gonna take a LONG time )

Any bright ideas?

Here’s what I’d do:

Create a new column, Sentence_checksum BINARY(16), with a regular INDEX.

A binary MD5 checksum, 16 bytes in length rather than HEX alternative of 32… You could always put an index on the Sentence column, but to it would have to be full column length to correct match dupes - meaning a large index.

Then I would perform:

UPDATE table SET Sentence_checksum=UNHEX(MD5(Sentence))

Then I would create a table exactly like the old table, but this time Sentence_checksum would be altered to a UNIQUE index type.

I would then perform:

INSERT INTO table_new SELECT * FROM table_old GROUP BY Sentence_checksum

Then DROP the old table, and rename the new one.

That’s a great idea!