Large varchar unique keys?

This is sort of on the edge of performance, as in, the way I’d implemented it involves a trigger on a table which might not be necessary.

I’ve a requirement to have a table with potentially very long strings, at the moment I’ve the column set to 1024 length which might still be too small.

The problem is that this long string has to be unique.

The MySQL key size is limited to 1000bytes or so, and with utf8, that means you’re limited to 333 for the key… so if I have a String in the database length 500 eg.
aaaaaaa…aaaa (length 500)

and I add

aaaaaaa…aaab (length 500)

this will fail with a duplicate key since it’ll only ever check the first 333 characters for the unique key

I fixed this by just having a normal index on this column for searches, but making a trigger to do a md5(upper(‘newvalue’)) on newly added strings and setting this value into a column called ‘myhash’ and putting a unique key on ‘myhash’

In the real world, I never should come across duplicates (the table should really only get to about 1million rows max) … but on the offchance, should I make another column ‘myhash2’ and use another encryption method and put a unique constraint on myhash+myhash2?

Changes of getting a duplicate MD5 hash is minute, but its the ‘birthday paradox’, the more you have the better chance of a duplicate

Looking around again, this wiki is better than the last with some values of probability ) Birthday_attack

It shows I’d need
26,000,000,000,000 hashs to have 10^-12 chance of two hashes being the same

unless I’m reading it wrong :wink:

I can live with that )

I guess it really depends on what the MD5 hash is being used for, and how critical it is.

Whats the worst that could happen if there was an MD5 collision? Its extremely unlikely for arbitrary data, but if you rely on it being 100% unique, sods law dictates that it might just bite you one day :smiley:

If its absolutely critical, then sure, I guess you could also apply a 2nd hash, but to save on performance only check that in the unlikely event the first hash collides.
Personally I use MD5 for hashes but Ive never found a reason to use them for keys

On the other hand, you can over-engineer stuff too, so if its not critical (ie security, financial) then in most cases it would suffice to just rely on it being unique, but at least have some monitoring in place for when there is a collision, so it will notify you that an insert failed.

making a trigger to do a md5(upper(‘newvalue’))

I’m assuming you are using a case-insensitive collation. Otherwise you probably do not want to UPPER the plaintext