Hi all! (Please be kind with my poor english…)
THANK YOU for this blog! :o)
I am developing on my spare time a little CMS storing data in a hierarchical manner. After googling and testing a while, I end up using the “materialized path” solution despite its unefficient storage requirement.
Here is the (simplified) tree table:
create table site_node ( id MEDIUMINT UNSIGNED NOT NULL AUTO_INCREMENT, # — path VARBINARY(32) NOT NULL, children SMALLINT UNSIGNED NOT NULL, parent MEDIUMINT UNSIGNED NOT NULL, alias VARBINARY(32) NOT NULL, # — PRIMARY KEY (id), UNIQUE INDEX ak_path (path), UNIQUE INDEX ak_parent (parent, alias), FOREIGN KEY fk_parent (parent) REFERENCES site_node (id) ON UPDATE CASCADE ON DELETE RESTRICT)ENGINE=InnoDB;
“parent” field is used to select direct children (faster than using path index).
“children” field contains the number of direct children (0 => leaf node).
“path” field contains the node path encoded in hexadecimal, each level is 4 bytes long:
mysql> SELECT id, parent, children, path FROM site_node WHERE path >= “0001” AND path < “0002” ORDER BY path LIMIT 20;±-------±-------±---------±-----------------+| id | parent | children | path |±-------±-------±---------±-----------------+| 2 | 1 | 521 | 0001 || 102 | 2 | 348 | 00010001 || 5128 | 102 | 133 | 000100010001 || 258090 | 5128 | 0 | 0001000100010001 || 261033 | 5128 | 0 | 0001000100010002 || 263014 | 5128 | 0 | 0001000100010003 || 269425 | 5128 | 0 | 0001000100010004 || 272585 | 5128 | 0 | 0001000100010005 || 283558 | 5128 | 0 | 0001000100010006 || 284133 | 5128 | 0 | 0001000100010007 || 284925 | 5128 | 0 | 0001000100010008 || 293219 | 5128 | 0 | 0001000100010009 || 309769 | 5128 | 0 | 000100010001000A || 311042 | 5128 | 0 | 000100010001000B || 316655 | 5128 | 0 | 000100010001000C || 337223 | 5128 | 0 | 000100010001000D || 339034 | 5128 | 0 | 000100010001000E || 346686 | 5128 | 0 | 000100010001000F || 348222 | 5128 | 0 | 0001000100010010 || 363623 | 5128 | 0 | 0001000100010011 |±-------±-------±---------±-----------------+20 rows in set (0.00 sec)
As expected, retrieving or counting all children from a node is trivial:
Cold:mysql> SELECT COUNT() FROM site_node WHERE path >= “0001” AND path < “0002” ORDER BY path;±---------+| COUNT() |±---------+| 601591 |±---------+1 row in set (13.07 sec)Warm:mysql> SELECT COUNT() FROM site_node WHERE path >= “0001” AND path < “0002” ORDER BY path;±---------+| COUNT() |±---------+| 601591 |±---------+1 row in set (0.55 sec)
Here is my problem:
when I need to move or shift a node, I must lock all affected nodes before updating them. For instance, before shifting node id #2 from rank 1 to rank 2 and updating 601591 rows, I lock the whole subtree :
Note: should be path < “0003” but you get the idea.
mysql> EXPLAIN SELECT COUNT(*) FROM site_node WHERE path >= “0001” AND path < “0002” ORDER BY path FOR UPDATE;±—±------------±----------±------±--------------±--------±--------±-----±-------±-------------------------+| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |±—±------------±----------±------±--------------±--------±--------±-----±-------±-------------------------+| 1 | SIMPLE | site_node | range | ak_path | ak_path | 32 | NULL | 902702 | Using where; Using index |±—±------------±----------±------±--------------±--------±--------±-----±-------±-------------------------+1 row in set (0.00 sec)
Index range scan should be fast (Handler_read_next going up) but it is not :
SELECT FOR UPDATE:mysql> SELECT COUNT() FROM site_node WHERE path >= “0001” AND path < “0002” ORDER BY path FOR UPDATE;±---------+| COUNT() |±---------+| 601591 |±---------+1 row in set (20 min 41.33 sec)Note: using SELECT COUNT(id) does not improve anything.Innodb output:LIST OF TRANSACTIONS FOR EACH SESSION:—TRANSACTION 0 36156404, ACTIVE 1208 sec, process no 20386, OS thread id 1448221616 fetching rows, thread declared inside InnoDB 136mysql tables in use 1, locked 126110 lock struct(s), heap size 1944896MySQL thread id 629, query id 30002552 localhost root Sending dataSELECT COUNT(*) FROM site_node WHERE path >= “0001” AND path < “0002” ORDER BY path FOR UPDATEBUFFER POOL AND MEMORY----------------------Total memory allocated 305426468; in additional pool allocated 1682688Buffer pool size 16384Free buffers 0Database pages 16289Modified db pages 1Pending reads 1Pending writes: LRU 0, flush list 0, single page 0Pages read 1367173, created 54631, written 605910228.55 reads/s, 0.00 creates/s, 0.00 writes/sBuffer pool hit rate 883 / 1000ROW OPERATIONS--------------1 queries inside InnoDB, 0 queries in queueMain thread process no. 20386, id 1439226800, state: waiting for server activityNumber of rows inserted 5000000, updated 10000000, deleted 0, read 725714210.00 inserts/s, 0.00 updates/s, 0.00 deletes/s, 598.03 reads/s
What I do not understand is that using a read lock is much faster:
SELECT LOCK IN SHARE MODE:mysql> SELECT COUNT() FROM site_node WHERE path >= “0001” AND path < “0002” ORDER BY path LOCK IN SHARE MODE;±---------+| COUNT() |±---------+| 601591 |±---------+1 row in set (13.57 sec)
After scratching my head a while, I tried this:
Full index scan:mysql> EXPLAIN SELECT COUNT() FROM site_node USE INDEX(PRIMARY) WHERE id != 0 AND path >= “0001” AND path < “0002” ORDER BY path FOR UPDATE;±—±------------±----------±------±--------------±--------±--------±-----±--------±------------+| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |±—±------------±----------±------±--------------±--------±--------±-----±--------±------------+| 1 | SIMPLE | site_node | range | PRIMARY | PRIMARY | 3 | NULL | 2501352 | Using where |±—±------------±----------±------±--------------±--------±--------±-----±--------±------------+1 row in set (0.04 sec)mysql> SELECT COUNT() FROM site_node USE INDEX(PRIMARY) WHERE id != 0 AND path >= “0001” AND path < “0002” ORDER BY path FOR UPDATE;±---------+| COUNT(*) |±---------+| 601591 |±---------+1 row in set (23.97 sec)Yep! Much better. :o|
And this:
Full table scan:mysql> EXPLAIN SELECT COUNT() FROM site_node USE INDEX(PRIMARY) WHERE path >= “0001” AND path < “0002” ORDER BY path FOR UPDATE;±—±------------±----------±-----±--------------±-----±--------±-----±--------±------------+| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |±—±------------±----------±-----±--------------±-----±--------±-----±--------±------------+| 1 | SIMPLE | site_node | ALL | NULL | NULL | NULL | NULL | 5002703 | Using where |±—±------------±----------±-----±--------------±-----±--------±-----±--------±------------+1 row in set (0.00 sec)mysql> SELECT COUNT() FROM site_node USE INDEX(PRIMARY) WHERE path >= “0001” AND path < “0002” ORDER BY path FOR UPDATE;±---------+| COUNT(*) |±---------+| 601591 |±---------+1 row in set (20.21 sec)Sigh… Better again…
The “site_node” table contains 5 millions rows and is arround 1Go (Data 396,288 Mo, Index 525,392 Mo, Total 921,680 Mo).
MySQL is running on a tiny box:
- CPU 3Ghz
- RAM 512 Mo
- HDD 80Go (IDE, 55 Mo/s)
skip-networkingtable_cache = 256innodb_file_per_table = 1innodb_buffer_pool_size = 256M (I cannot raise it up…)innodb_additional_mem_pool_size = 8Minnodb_log_file_size = 256Minnodb_log_buffer_size = 8Minnodb_flush_log_at_trx_commit = 2
Could it be because ak_path index does not fit in buffer pool?
Note: update speed is quite normal (15 000 - 20 000 rows/s).
I have also tried to use “path” field as PRIMARY KEY hoping that it would be faster using a clustered index but in this case total index size grows up and update speed is slower because of B-TREE rebalancing.
Thank you for your help!
PS: No, I do not plan to use nested sets… ;o) (not yet!)