Fastest Way to Remove Duplicates

GeoffreyF67 · May 30, 2007, 5:31pm

Suppose you had a table that had two columns:

ID (integer)
Sentence (varchar 255)

Due to legacy reasons, the Sentence field had duplicates in it.

What would be the quickest method to get a UNIQUE on the Sentence column as well as remove the duplicates?

The only way I’ve come up with so far is to rebuild the table from scratch with unique on. Since there are over 25 million rows, that’s gonna take a LONG time )

Any bright ideas?

Speeple · June 1, 2007, 4:02am

Here’s what I’d do:

Create a new column, Sentence_checksum BINARY(16), with a regular INDEX.

A binary MD5 checksum, 16 bytes in length rather than HEX alternative of 32… You could always put an index on the Sentence column, but to it would have to be full column length to correct match dupes - meaning a large index.

Then I would perform:

UPDATE table SET Sentence_checksum=UNHEX(MD5(Sentence))

Then I would create a table exactly like the old table, but this time Sentence_checksum would be altered to a UNIQUE index type.

I would then perform:

INSERT INTO table_new SELECT * FROM table_old GROUP BY Sentence_checksum

Then DROP the old table, and rename the new one.

GeoffreyF67 · June 1, 2007, 5:23pm

That’s a great idea!

Thanks!

Topic		Replies	Views
Large varchar unique keys? Other MySQL® Questions	2	2467	September 13, 2010
Query Optimization (2 queries) Other MySQL® Questions	0	398	July 24, 2008
very slow search query, help needed Other MySQL® Questions	2	425	March 21, 2009
using index puzzle Other MySQL® Questions	3	579	August 4, 2011
importing data from file with --replace vs. --ignore Other MySQL® Questions	1	419	December 18, 2011

Fastest Way to Remove Duplicates

Related topics