Not the answer you need?
Register and ask your own question!

hash index lookup

sinterklaassinterklaas EntrantInactive User Role Beginner
Hi,
After asking this question to the publishers, they suggested I post it here, so here it goes:

I tried the hash indexing techniques mentioned in Chapter 5 (Indexing for high performance) of "High Performance MySQL" by Schwartz, Zaitsev and Tkachenko.
Below is a description of what I tried, my setup and my results. I hope you can comment on that and let me know if I missed anything, misunderstood anything or did anything wrong.

I have a table of (several millions of) urls which is growing a bit large (a few GiB) and I figured using a hash index on the urls might decrease this size.
This is the original create code of my table (without hash index):
CREATE TABLE `urls` (
`Id` INT(10) UNSIGNED NOT NULL AUTO_INCREMENT, `URL` TEXT NOT NULL, PRIMARY KEY (`Id`), UNIQUE INDEX `Url` (`URL`(255)) USING BTREE
)
COLLATE='utf8_general_ci'
ENGINE=InnoDB

For a table with 1 million urls in it (I experimented with just a subset of my complete urls table), this results in a table of about 240 MiB.

Using a hash index on the url column, as all suggested and explained in Chapter 5:

CREATE TABLE `pseudohash` (

`id` INT(10) UNSIGNED NOT NULL AUTO_INCREMENT,

`url` VARCHAR(255) NOT NULL,

`url_crc` INT(10) UNSIGNED NOT NULL DEFAULT '0',

PRIMARY KEY (`id`),

INDEX `UrlHash` (`url_crc`)

)

COLLATE='utf8_general_ci'

ENGINE=InnoDB

I managed to basically half the size of the table. The same 1 million urls in this table result in a table of about 120 MiB.
For completeness' sake, I copy the trigger I used before inserting into this table (but it's basically directly taken from Chapter 5 as well):
SET @[email protected]@SQL_MODE, SQL_MODE='STRICT_TRANS_TABLES,NO_AUTO_CREATE_USER,NO_ENGINE_SUBSTITUTION';
DELIMITER //
CREATE TRIGGER `pseudohash_crc_ins` BEFORE INSERT ON `pseudohash` FOR EACH ROW BEGIN SET NEW.url_crc=crc32(NEW.url); END// DELIMITER ; SET [email protected]_SQL_MODE;

So I was very happy about significantly reducing storage space for this table. But then I decided to compare lookup times for both tables.
I took the first 10.000 rows from both tables (the content of both tables are identical, apart from the url_crc column) and measured the time it took to look them up one by one, using a simple 'SELECT id FROM urls WHERE url='$url''. Obviously, because in the second case, there is no index on url, I needed an extra clause, adding up to: 'SELECT id FROM pseudohash WHERE url_crc=CRC32(($url)) AND url=($url)'.

Doing this in a script on my system (not really relevant, since it's the relative difference I'm interested in, but in any case: Windows 8.1 64-bit, 3.4 GhZ processor, 12 GB memory) resulted in the following:
Looking up these 10k urls in the table with normal/b-tree index: roughly 17 seconds.
Looking up the same 10k urls in the table with hash index: roughly 1 minute and 18 seconds.

Though significantly reducing storage space, this performance issue renders this implementation of hash index pretty much useless in my case. So my question is:
Is this extra time just due to the fact that this extra operation has to be done ('WHERE url_crc=CRC32($url)...') and is this a known drawback of this hash index approach? Or am I missing something or did I do anything stupid to unnecessarily slow it down?

Hope you can answer my question. If there is any additional information you need in order to be able to answer my question, I would be happy to provide you with that. Please let me know!

cheers,
peter
Sign In or Register to comment.

MySQL, InnoDB, MariaDB and MongoDB are trademarks of their respective owners.
Copyright ©2005 - 2020 Percona LLC. All rights reserved.