Hi,
After asking this question to the publishers, they suggested I post it here, so here it goes:
I tried the hash indexing techniques mentioned in Chapter 5 (Indexing for high performance) of “High Performance MySQL” by Schwartz, Zaitsev and Tkachenko.
Below is a description of what I tried, my setup and my results. I hope you can comment on that and let me know if I missed anything, misunderstood anything or did anything wrong.
I have a table of (several millions of) urls which is growing a bit large (a few GiB) and I figured using a hash index on the urls might decrease this size.
This is the original create code of my table (without hash index):
CREATE TABLE urls
(
Id
INT(10) UNSIGNED NOT NULL AUTO_INCREMENT, URL
TEXT NOT NULL, PRIMARY KEY (Id
), UNIQUE INDEX Url
(URL
(255)) USING BTREE
)
COLLATE=‘utf8_general_ci’
ENGINE=InnoDB
For a table with 1 million urls in it (I experimented with just a subset of my complete urls table), this results in a table of about 240 MiB.
Using a hash index on the url column, as all suggested and explained in Chapter 5:
CREATE TABLE pseudohash
(
id
INT(10) UNSIGNED NOT NULL AUTO_INCREMENT,
url
VARCHAR(255) NOT NULL,
url_crc
INT(10) UNSIGNED NOT NULL DEFAULT ‘0’,
PRIMARY KEY (id
),
INDEX UrlHash
(url_crc
)
)
COLLATE=‘utf8_general_ci’
ENGINE=InnoDB
I managed to basically half the size of the table. The same 1 million urls in this table result in a table of about 120 MiB.
For completeness’ sake, I copy the trigger I used before inserting into this table (but it’s basically directly taken from Chapter 5 as well):
SET @OLDTMP_SQL_MODE=@@SQL_MODE, SQL_MODE=‘STRICT_TRANS_TABLES,NO_AUTO_CREATE_USER,NO_ENGINE_SUBSTITUTION’;
DELIMITER //
CREATE TRIGGER pseudohash_crc_ins
BEFORE INSERT ON pseudohash
FOR EACH ROW BEGIN SET NEW.url_crc=crc32(NEW.url); END// DELIMITER ; SET SQL_MODE=@OLDTMP_SQL_MODE;
So I was very happy about significantly reducing storage space for this table. But then I decided to compare lookup times for both tables.
I took the first 10.000 rows from both tables (the content of both tables are identical, apart from the url_crc column) and measured the time it took to look them up one by one, using a simple ‘SELECT id FROM urls WHERE url=’$url’'. Obviously, because in the second case, there is no index on url, I needed an extra clause, adding up to: ‘SELECT id FROM pseudohash WHERE url_crc=CRC32(($url)) AND url=($url)’.
Doing this in a script on my system (not really relevant, since it’s the relative difference I’m interested in, but in any case: Windows 8.1 64-bit, 3.4 GhZ processor, 12 GB memory) resulted in the following:
Looking up these 10k urls in the table with normal/b-tree index: roughly 17 seconds.
Looking up the same 10k urls in the table with hash index: roughly 1 minute and 18 seconds.
Though significantly reducing storage space, this performance issue renders this implementation of hash index pretty much useless in my case. So my question is:
Is this extra time just due to the fact that this extra operation has to be done (‘WHERE url_crc=CRC32($url)…’) and is this a known drawback of this hash index approach? Or am I missing something or did I do anything stupid to unnecessarily slow it down?
Hope you can answer my question. If there is any additional information you need in order to be able to answer my question, I would be happy to provide you with that. Please let me know!
cheers,
peter