Hi guys!
The database in question is a storage/search database for md5 hashes with values. Two columns per table, one with the md5 hash (32 letters in length) and one for the value. The md5 hash field is indexed (type UNIQUE). Once a row has been added, it will likely never be updated/deleted.
Currently, it has roughly 260 million rows altogether, spread over 40 tables with ca 5 million rows per table… the index files average at 220-250mb per table and load into the memory very speedily. The tables are MyISAM.
The hardware is a Dual Xeon 2.8ghz with 2GB of RAM but will be upgraded as needed.
To check a random hash, in the best case scenario it takes approximately 0.7 seconds. Although average is more often than not around 1.2 seconds. I’m very satisfied with that speed although I believe it can be faster. It obviously becomes faster per hash the more hashes I check at a time.
Now to my dilemma… I aim to expand the database with at least another 2 billion rows during the next week. And the way it is currently setup, it’s perhaps acceptable when checking large lists of hashes even if it would grow to hundreds of tables… but I wish to maintain speed for checking single hashes as well.
So, it needs a re-design! My thoughts are currently to sort all hashes into tables with the tablenames corresponding to the first two letters of a hash. That would result in 1296 tables in total … a-z0-9 (36x36). When checking a hash like:
00002059f54853e4c21e290bc1a0d7a9
It would then check table 00 for the hash, which would be significantly faster than traversing hundreds of tables and billions of rows to find the match.
This setup would also allow the database to scale upwards without size impacting much upon lookup speed.
What I’m concerned about however are two things:
#1 That the change in structure will, while maintaining lookup speed for single hashes (or more likely speed those up), checking large lists of hashes (1000+) would take much longer. I can’t back this concern up with any numbers though so I suppose it could be entirely unfounded…
#2 That there’ll be a performance hit because the number of tables will exceed 1000. I’ve read numerous sites where people warn against having that many tables…
Would I be even better off splitting those 1296 tables into 36 databases instead, one for each beginning letter (like a,b,c,d, etc) … with 36 tables per database?
It’s in the plans to add two SCSI 320 drivers in a RAID config, giving up towards 500 mb/sec of transfer speed… along with upping the RAM to 8GB or more. So that’s to kept in mind while finding the best scaleable design here.
What are your thoughts on this?
I would really like to do this right, since it’ll be a pain to change it all later if there’s a better method.
Thanks and nice to be here!
Felix