Large database (billions of rows) with minimal writes, best setup for fast searches?

Hi guys!

The database in question is a storage/search database for md5 hashes with values. Two columns per table, one with the md5 hash (32 letters in length) and one for the value. The md5 hash field is indexed (type UNIQUE). Once a row has been added, it will likely never be updated/deleted.

Currently, it has roughly 260 million rows altogether, spread over 40 tables with ca 5 million rows per table… the index files average at 220-250mb per table and load into the memory very speedily. The tables are MyISAM.

The hardware is a Dual Xeon 2.8ghz with 2GB of RAM but will be upgraded as needed.

To check a random hash, in the best case scenario it takes approximately 0.7 seconds. Although average is more often than not around 1.2 seconds. I’m very satisfied with that speed although I believe it can be faster. It obviously becomes faster per hash the more hashes I check at a time.

Now to my dilemma… I aim to expand the database with at least another 2 billion rows during the next week. And the way it is currently setup, it’s perhaps acceptable when checking large lists of hashes even if it would grow to hundreds of tables… but I wish to maintain speed for checking single hashes as well.

So, it needs a re-design! My thoughts are currently to sort all hashes into tables with the tablenames corresponding to the first two letters of a hash. That would result in 1296 tables in total … a-z0-9 (36x36). When checking a hash like:

00002059f54853e4c21e290bc1a0d7a9

It would then check table 00 for the hash, which would be significantly faster than traversing hundreds of tables and billions of rows to find the match.

This setup would also allow the database to scale upwards without size impacting much upon lookup speed.

What I’m concerned about however are two things:

#1 That the change in structure will, while maintaining lookup speed for single hashes (or more likely speed those up), checking large lists of hashes (1000+) would take much longer. I can’t back this concern up with any numbers though so I suppose it could be entirely unfounded…

#2 That there’ll be a performance hit because the number of tables will exceed 1000. I’ve read numerous sites where people warn against having that many tables…

Would I be even better off splitting those 1296 tables into 36 databases instead, one for each beginning letter (like a,b,c,d, etc) … with 36 tables per database?

It’s in the plans to add two SCSI 320 drivers in a RAID config, giving up towards 500 mb/sec of transfer speed… along with upping the RAM to 8GB or more. So that’s to kept in mind while finding the best scaleable design here.

What are your thoughts on this?

I would really like to do this right, since it’ll be a pain to change it all later if there’s a better method.

Thanks and nice to be here!

Felix

My first thoughts are actually the same.

But doesn’t MD5 hashes consist of hexadecimal values? That would be like 0-255 (00-FF).

You than have lookup table with 16 columns and a column with the word you want to lookup.

Each of the 16 columns have a value between 0 and 255. When you do a lookup you first have the md5 hash and make decimals from the parts. So you get 16 parts with values.

$part_1 = hexdec(substr($hash, 0, 2));
$part_2 = hexdec(substr($hash, 2, 2));
$part_3 = hexdec(substr($hash, 4, 2));

(in PHP it is) etc.

Then in your table you match each part with the right column:

SELECT word FROM lookup WHERE col1 = part_1 AND col2 = part_2;

I really have no idea if it works. Maybe the hexdec conversion takes to long. Maybe the 16 columns matching takes to long. I don’t know. Just my first thoughts.

It’s a 32-char string of hexadecimal numbers… and your approach sounds unnecessarily complex and would take too long, I’m afraid.

I’m not looking for some sort of bruteforcing algorithm. This is a very simple md5<>value matchup. The values are already entered in the database and all the program does is check if the md5 hash is existent and if there’s a value for it and if there is, returns it. The only question is how to structure the database/tables for best scalability so as to not impact checkup speed.

But thank you anyway for your thoughts.