Ok, the challenge: i have two databases here, which store a tree of content elements (which means nodes, their relations and associated attributes of different types). Both use hex-‘hash’ values to identify the nodes and rely on nested sets for retrieval of ancestors/descendants.
First DB layout is pretty simple, as there’s no way to use multiple parents (i will strip most irrelevant columns):
Nodes table (MyISAM):±-----------------±--------------------------------------------------±-----±----±--------±------+| Field | Type | Null | Key | Default | Extra |±-----------------±--------------------------------------------------±-----±----±--------±------+| inode | varchar(40) | NO | PRI | | | … | parent_inode | varchar(40) | YES | MUL | NULL | || top_parent_inode | varchar(40) | NO | MUL | | | …| obj_type | varchar(255) | YES | MUL | NULL | || text_id | varchar(255) | YES | | NULL | || pos | int(11) | YES | | NULL | || order | enum(‘pos_asc’,‘pos_desc’,‘name_asc’,‘name_desc’) | YES | | pos_asc | || lft | int(10) unsigned | NO | MUL | 0 | || rgt | int(10) unsigned | NO | MUL | 0 | || level | mediumint(8) unsigned | NO | | 0 | |±-----------------±--------------------------------------------------±-----±----±--------±------+Attributes table (MyISAM):±------------±-----------------------------------------±-----±----±---------------±------+| Field | Type | Null | Key | Default | Extra |±------------±-----------------------------------------±-----±----±---------------±------+| inode | varchar(40) | NO | PRI | | || key | varchar(255) | NO | PRI | | || type | enum(‘text’,‘blob’,‘int’,‘undefined’) | YES | MUL | undefined | | …| value | longblob | YES | | NULL | |±------------±-----------------------------------------±-----±----±---------------±------+
The second layout is somewhat more complicated, because nodes can be children in form of ‘softlinks’ and thus appear several times in the tree, in which case the b_primary in the hierarchy table is FALSE:
Nodes Table (InnoDB):±-----------------±---------------------±-----±----±--------------------±------+| Field | Type | Null | Key | Default | Extra |±-----------------±---------------------±-----±----±--------------------±------+| uuid | char(32) | NO | PRI | | || fk_nodetype | varchar(40) | NO | | node | || s_label | varchar(150) | NO | MUL | | || s_name | varchar(50) | NO | | | | …±-----------------±---------------------±-----±----±--------------------±------+Attributes table (InnoDB):±-----------------±-------------±-----±----±--------±------+| Field | Type | Null | Key | Default | Extra |±-----------------±-------------±-----±----±--------±------+| fk_node | char(32) | NO | PRI | 0 | || fk_attributename | varchar(100) | NO | PRI | | | …| m_content | longblob | YES | | NULL | |±-----------------±-------------±-----±----±--------±------+Hierarchy information table (InnoDB):±-----------------±---------------------±-----±----±--------±------+| Field | Type | Null | Key | Default | Extra |±-----------------±---------------------±-----±----±--------±------+| fk_parent | char(32) | NO | PRI | | || fk_child | char(32) | NO | PRI | | || b_primary | enum(‘TRUE’,‘FALSE’) | NO | | TRUE | || n_order | int(10) unsigned | YES | | NULL | || n_left | bigint(20) unsigned | YES | | NULL | || n_right | bigint(20) unsigned | YES | | NULL | || n_level | int(10) unsigned | YES | | NULL | || b_positionlocked | enum(‘TRUE’,‘FALSE’) | NO | | FALSE | |±-----------------±---------------------±-----±----±--------±------+
In both cases, the ‘inode’ resp. ‘uuid’ columns store a primary key with a 32 characters UUID, which is used for foreign key relations to other tables.
The main problem is, that the nested sets’ cost becomes unbearable relatively quickly. Every time an node is inserted/deleted all following entries have to be updated, moving a node is even worse as it includes updating left/right values separately and has to use custom locking to work right, resulting in 5+ queries for one operation. As several of these procedures are executed sequentially in transactions when necessary in the latter case, this can lead to several seconds someone has to wait if a simple subtree (hierarchy of 20 nodes, e.g.) is created/copied/moved. And we are talking of ~100.000 nodes here, i can only dream of how performance would be with 1.000.000+ rows.
So here are my ideas how to optimize the layouts and which i want to discuss here:
Simple idea (but much effort to revise the queries perhaps): to reduce the join efforts (there are several other tables connected via the UUID), i could remove the UUID as FK relations, and use a INT auto_increment instead (but keep the UUIDs in the node table). But i don’t know if this will really help much when joining the attributes or other tables with the primary table holding the nodes.
Complex idea (but not sure if it really makes sense): to remove a significant portion of the nested set updating, it might be sensible to move the left, right, level data into a separate MEMORY table and only leave the adjacency list info (parent, child, order) in the InnoDB table. This increases the update speed from several hundred milliseconds to ~0 milliseconds. BUT, it also means the nested set structure has to be rebuilt after server restart, is not anymore transaction safe (in the latter layout), and it requires even more joins to retrieve all required data in several cases.
After all, the adjacency data is the most ‘relevant’, as it holds all relevant hierarchy data, is better human-readable, and less error-prone/can be changed easily manually. However, if you have to crawl a whole subtree of several thousand nodes recursively to gain information about nodes (how many foo-subnodes are there? is there a node the user has read permission to?) it’s extremely costly. I’m searching for the ‘almost perfect’ sybiosis of the two concepts.
So, what are your opinions on this? Is there some kind of established ‘best practice’ with nested sets (or better hierarchies) and data bases i haven’t found yet? The Problem with slow updates at the cost of fast (descendant-)queries is off course self-chosen, but how can it be made as efficient as possible? Someone must have run into the same questions, as nested sets are quite popular, and are surely not only used for 50 entry menu structures. There are other concepts which all have their drawbacks (fraction-something, materialized path, pre/post), but perhaps you would recommend something completely different?