Concurrency and locks issue in `INSERT ... SELECT ...` - TokuDB

Hi,

The system my team develops has many huge tables in MySQL databases. Since our clients need faster responses from our application, we are testing alternatives to the MyISAM engine, our default engine.

We are testing the TokuDB engine. After few attempts, we stumbled on timeouts issues, fixed by changes in tokudb_lock_timeout configuration variable. Although the timeout problem was fixed, queries of type INSERT ... <insert_table> SELECT ... <select_tables> started to queue, get stuck by query locks, even though the insert_table is different for every query in the queue. However, queries of type SELECT ... <select_tables> do not get stuck. We’ve tried some fine tuning in the configuration file, but the problem continues.

Here is a reproduction guide: [URL]https://gist.github.com/susanoobit/33e49cdfa80e66fd7392e755acf716b7[/URL]
Our configuration file (…/percona-server.conf.d/mysqld.cnf): [URL]https://gist.github.com/susanoobit/e8610687247aaeab48c5911e8f161958[/URL]
Our test server has 62G RAM, 27G SWAP, CPU Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz.

Searching for a solution, we’ve found that TokuDB has exclusive read locks according to this document: [URL]Transactions and Concurrency · percona/tokudb-engine Wiki · GitHub.

Is the exclusive read locks the actual problem?

Thanks in advance.

Yes, that is very possible. Some many years back, before Percona acquired Tokutek, TokuDB had shared read locks. For some reason unknown to me and recent Tokutek developers, that changed to exclusive read locks. We have considered investigating why and if it is possible to go back to shared read locks without harming write performance, but that is not likely to happen any time in the near future.

Keep in mind that TokuDB is a write optimized storage engine, not read optimized. Generally is expected to outperform InnoDB only when your dataset grows well beyond available memory where InnoDB performance begins to decline, TokuDB will remain relatively constant for much longer dataset growth.

TokuDB is also known to have concurrency issues on a single index/tree due to the way the internal node locking works and also due to the fact that default node sizes are much larger than that of InnoDB, so, when a node is locked, it is a much larger percentage of the overall dataset than an InnoDB node. This only begins showing as a problem if you have many parallel client DML to a single table.index.

On modern hardware with SSD storage, the default tree shape values are not ideal. The defaults are tuned for high throughput and high latency I/O (spinning disks in RAID). For SSD, we usually recommend a much smaller node size (tokudb_block_size) than the default of 4M, somewhere in the range of 32K-512K, and a much wider fanout (tokudb_fanout) than the default of 16, somewhere in the range of 512-1024. Then there is the ‘hard’ checkpointer that wakes every 60 seconds by default. Of course like everything else, it all really depends on your workload and what you care about most, compression, read performance, write performance, memory usage, CPU usage, disk wear, etc…

Is there any plan to fix this concurrency issue? In my company we are seriously considering TokuDB as a main storage engine for an in house build cluster system for timeseries data with up to 250TB per node (before compression). So far, the write and read performance is way better than expected when querying multiple tables at once however we need to assure consistent read performance when doing concurrent reads from same table (given that IO is not a bottleneck).

There are no immediate, short term plans to address either the exclusive read lock or the internal node lock contentions but these issues are high on the longer term list of things to address when time and resources permit. You might also want to explore MyRocks for your application as it offers similar benefits as TokuDB but it has some of its own limitations as well.

We’re considering MyRocks however for now TokuDB has been proven in production for our scenario. On heavy loads we get 2 to 4 times more throughput in retrieving sequential subsets due to compression, all with less IO.

We are using temporary tables as a workaround and it has not shown much throughput decay. Basically, select data from the table you need, insert into a temporary table, then select data from the temporary table and insert into the table you want.

No, I was wrong. We are exporting data to a file and then loading to the table. Sorry, I was confused.