hash index lookup

sinterklaas · March 26, 2014, 2:44am

Hi,
After asking this question to the publishers, they suggested I post it here, so here it goes:

I tried the hash indexing techniques mentioned in Chapter 5 (Indexing for high performance) of “High Performance MySQL” by Schwartz, Zaitsev and Tkachenko.
Below is a description of what I tried, my setup and my results. I hope you can comment on that and let me know if I missed anything, misunderstood anything or did anything wrong.

I have a table of (several millions of) urls which is growing a bit large (a few GiB) and I figured using a hash index on the urls might decrease this size.
This is the original create code of my table (without hash index):
CREATE TABLE urls (
Id INT(10) UNSIGNED NOT NULL AUTO_INCREMENT, URL TEXT NOT NULL, PRIMARY KEY (Id), UNIQUE INDEX Url (URL(255)) USING BTREE
)
COLLATE=‘utf8_general_ci’
ENGINE=InnoDB

For a table with 1 million urls in it (I experimented with just a subset of my complete urls table), this results in a table of about 240 MiB.

Using a hash index on the url column, as all suggested and explained in Chapter 5:

CREATE TABLE pseudohash (

id INT(10) UNSIGNED NOT NULL AUTO_INCREMENT,

url VARCHAR(255) NOT NULL,

url_crc INT(10) UNSIGNED NOT NULL DEFAULT ‘0’,

PRIMARY KEY (id),

INDEX UrlHash (url_crc)

)

COLLATE=‘utf8_general_ci’

ENGINE=InnoDB

I managed to basically half the size of the table. The same 1 million urls in this table result in a table of about 120 MiB.
For completeness’ sake, I copy the trigger I used before inserting into this table (but it’s basically directly taken from Chapter 5 as well):
SET @OLDTMP_SQL_MODE=@@SQL_MODE, SQL_MODE=‘STRICT_TRANS_TABLES,NO_AUTO_CREATE_USER,NO_ENGINE_SUBSTITUTION’;
DELIMITER //
CREATE TRIGGER pseudohash_crc_ins BEFORE INSERT ON pseudohash FOR EACH ROW BEGIN SET NEW.url_crc=crc32(NEW.url); END// DELIMITER ; SET SQL_MODE=@OLDTMP_SQL_MODE;

So I was very happy about significantly reducing storage space for this table. But then I decided to compare lookup times for both tables.
I took the first 10.000 rows from both tables (the content of both tables are identical, apart from the url_crc column) and measured the time it took to look them up one by one, using a simple ‘SELECT id FROM urls WHERE url=’$url’'. Obviously, because in the second case, there is no index on url, I needed an extra clause, adding up to: ‘SELECT id FROM pseudohash WHERE url_crc=CRC32(($url)) AND url=($url)’.

Doing this in a script on my system (not really relevant, since it’s the relative difference I’m interested in, but in any case: Windows 8.1 64-bit, 3.4 GhZ processor, 12 GB memory) resulted in the following:
Looking up these 10k urls in the table with normal/b-tree index: roughly 17 seconds.
Looking up the same 10k urls in the table with hash index: roughly 1 minute and 18 seconds.

Though significantly reducing storage space, this performance issue renders this implementation of hash index pretty much useless in my case. So my question is:
Is this extra time just due to the fact that this extra operation has to be done (‘WHERE url_crc=CRC32($url)…’) and is this a known drawback of this hash index approach? Or am I missing something or did I do anything stupid to unnecessarily slow it down?

Hope you can answer my question. If there is any additional information you need in order to be able to answer my question, I would be happy to provide you with that. Please let me know!

cheers,
peter

Topic		Replies	Views
Large database (billions of rows) with minimal writes, best setup for fast searches? Other MySQL® Questions	2	574	April 30, 2008
Need Faster query results Other MySQL® Questions	5	490	July 17, 2009
Indexing Other MySQL® Questions	2	412	September 10, 2009
index query on table with 14mm+ (and growing) records takes more than 2 seconds Other MySQL® Questions	10	745	September 7, 2009
faster not using Index???? Other MySQL® Questions	2	376	January 18, 2007

hash index lookup

Related topics