Percona TokuDB recovery running over 24 hours

Yesterday, we had a situation that presented (at least to my eyes) like a deadlock. A Truncate table (running against an un-contended) table, had been issued, it’s status was “query completed” but never cleared from the process list. Further, user connections to unrelated tables within that schema were unable to connect with “waiting for metadata lock (or similar, didn’t get the screenshot)”.

A first attempt to kill the truncate query, then then the hung queries, resulted in their status changing to “killed” but never had them disappear from the process list.

At this point, even a superuser was unable to login to any schema (a previously opened monitor showed the connection as – waiting for metadata lock).

An orderly shutdown was attempted but after some lengthy period, it looks like the OS kill -9’d mysql. As a bunch of memory unexpectedly seemed pinned, a reboot seemed in order and was done.

Now, for the past 24ish hours, Percona seems to be either replaying (very–slowly) a number of transactions that seem unbelievable or is in a loop of some kind. To give you a sense of the log messages:

2015-08-07 09:01:42 25894 [Note] InnoDB: Mutexes and rw_locks use GCC atomic builtins
2015-08-07 09:01:42 25894 [Note] InnoDB: Memory barrier is not used
2015-08-07 09:01:42 25894 [Note] InnoDB: Compressed tables use zlib 1.2.3
2015-08-07 09:01:42 25894 [Note] InnoDB: Using Linux native AIO
2015-08-07 09:01:42 25894 [Note] InnoDB: Using CPU crc32 instructions
2015-08-07 09:01:42 25894 [Note] InnoDB: Initializing buffer pool, size = 20.0G
2015-08-07 09:01:43 25894 [Note] InnoDB: Completed initialization of buffer pool
2015-08-07 09:01:44 25894 [Note] InnoDB: Highest supported file format is Barracuda.
2015-08-07 09:01:44 25894 [Note] InnoDB: The log sequence numbers 551884808204 and 551884808204 in ibdata files do not match the log sequence number 551884808684 in the ib_logfiles!
2015-08-07 09:01:44 25894 [Note] InnoDB: Database was not shutdown normally!
2015-08-07 09:01:44 25894 [Note] InnoDB: Starting crash recovery.
2015-08-07 09:01:44 25894 [Note] InnoDB: Reading tablespace information from the .ibd files...
2015-08-07 09:01:44 25894 [Note] InnoDB: Restoring possible half-written data pages
2015-08-07 09:01:44 25894 [Note] InnoDB: from the doublewrite buffer...
2015-08-07 09:01:44 25894 [Note] InnoDB: starting tracking changed pages from LSN 551884808684
InnoDB: Transaction 122307881 was in the XA prepared state.
InnoDB: 1 transaction(s) which must be rolled back or cleaned up
InnoDB: in total 0 row operations to undo
InnoDB: Trx id counter is 122333952
InnoDB: Last MySQL binlog file position 0 134745064, file name mysqld-bin.000281
2015-08-07 09:01:44 25894 [Note] InnoDB: 128 rollback segment(s) are active.
InnoDB: Starting in background the rollback of uncommitted transactions
2015-08-07 09:01:44 7f7b96bff700 InnoDB: Rollback of non-prepared transactions completed
2015-08-07 09:01:44 25894 [Note] InnoDB: Waiting for purge to start
2015-08-07 09:01:44 25894 [Note] InnoDB: Percona XtraDB (http://www.percona.com) 5.6.25-73.1 started; log sequence number 551884808684
Fri Aug 7 09:01:44 2015 TokuFT recovery starting in env /var/lib/mysql/
Fri Aug 7 09:01:45 2015 TokuFT recovery scanning backward from 957963270
Fri Aug 7 09:01:45 2015 TokuFT recovery bw_end_checkpoint at 957917890 timestamp 1438892691965548 xid 957841798 (bw_newer)
Fri Aug 7 09:01:45 2015 TokuFT recovery bw_begin_checkpoint at 957841798 timestamp 1438892662604852 (bw_between)
Fri Aug 7 09:01:45 2015 TokuFT recovery turning around at begin checkpoint 957841798 time 29360696
Fri Aug 7 09:01:45 2015 TokuFT recovery starts scanning forward to 957963270 from 957841798 left 121472 (fw_between)
Fri Aug 7 09:02:09 2015 TokuFT lsn 957930340 commit xid 405185:0 1909760/293672232 1%
Fri Aug 7 09:02:24 2015 TokuFT lsn 957930340 commit xid 405185:0 3742720/293672232 1%
Fri Aug 7 09:02:39 2015 TokuFT lsn 957930340 commit xid 405185:0 5556224/293672232 2%
Fri Aug 7 09:02:54 2015 TokuFT lsn 957930340 commit xid 405185:0 7327744/293672232 2%
Fri Aug 7 09:03:09 2015 TokuFT lsn 957930340 commit xid 405185:0 8884224/293672232 3%
-- snip --
Fri Aug 7 09:21:09 2015 TokuFT lsn 957930340 commit xid 405185:0 150720512/293672232 51%
Fri Aug 7 09:21:24 2015 TokuFT lsn 957930340 commit xid 405185:0 152825856/293672232 52%
Fri Aug 7 09:21:39 2015 TokuFT lsn 957930340 commit xid 405185:0 154894336/293672232 53%

150807 09:21:42 mysqld_safe Transparent huge pages are already set to: never.
150807 09:21:42 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
2015-08-07 09:21:42 0 [Note] /usr/sbin/mysqld (mysqld 5.6.25-73.1-log) starting as process 29206 ...
2015-08-07 09:21:42 29206 [Note] InnoDB: Using atomics to ref count buffer pool pages
2015-08-07 09:21:42 29206 [Note] InnoDB: The InnoDB memory heap is disabled
2015-08-07 09:21:42 29206 [Note] InnoDB: Mutexes and rw_locks use GCC atomic builtins
2015-08-07 09:21:42 29206 [Note] InnoDB: Memory barrier is not used
2015-08-07 09:21:42 29206 [Note] InnoDB: Compressed tables use zlib 1.2.3
2015-08-07 09:21:42 29206 [Note] InnoDB: Using Linux native AIO
2015-08-07 09:21:42 29206 [Note] InnoDB: Using CPU crc32 instructions
2015-08-07 09:21:42 29206 [Note] InnoDB: Initializing buffer pool, size = 20.0G
2015-08-07 09:21:43 29206 [Note] InnoDB: Completed initialization of buffer pool
2015-08-07 09:21:44 29206 [Note] InnoDB: Highest supported file format is Barracuda.
2015-08-07 09:21:44 29206 [Note] InnoDB: The log sequence numbers 551884808204 and 551884808204 in ibdata files do not match the log sequence number 551884808694 in the ib_logfiles!
2015-08-07 09:21:44 29206 [Note] InnoDB: Database was not shutdown normally!
2015-08-07 09:21:44 29206 [Note] InnoDB: Starting crash recovery.
2015-08-07 09:21:44 29206 [Note] InnoDB: Reading tablespace information from the .ibd files...
2015-08-07 09:21:44 29206 [Note] InnoDB: Restoring possible half-written data pages
2015-08-07 09:21:44 29206 [Note] InnoDB: from the doublewrite buffer...
2015-08-07 09:21:44 29206 [Note] InnoDB: starting tracking changed pages from LSN 551884808694
InnoDB: Transaction 122307881 was in the XA prepared state.
InnoDB: 1 transaction(s) which must be rolled back or cleaned up
InnoDB: in total 0 row operations to undo
InnoDB: Trx id counter is 122334464
InnoDB: Last MySQL binlog file position 0 134745064, file name mysqld-bin.000281
2015-08-07 09:21:45 29206 [Note] InnoDB: 128 rollback segment(s) are active.
InnoDB: Starting in background the rollback of uncommitted transactions
2015-08-07 09:21:45 7feb457ff700 InnoDB: Rollback of non-prepared transactions completed
2015-08-07 09:21:45 29206 [Note] InnoDB: Waiting for purge to start
2015-08-07 09:21:45 29206 [Note] InnoDB: Percona XtraDB (http://www.percona.com) 5.6.25-73.1 started; log sequence number 551884808694
Fri Aug 7 09:21:45 2015 TokuFT recovery starting in env /var/lib/mysql/
Fri Aug 7 09:21:45 2015 TokuFT recovery scanning backward from 957963270
Fri Aug 7 09:21:45 2015 TokuFT recovery bw_end_checkpoint at 957917890 timestamp 1438892691965548 xid 957841798 (bw_newer)
Fri Aug 7 09:21:46 2015 TokuFT recovery bw_begin_checkpoint at 957841798 timestamp 1438892662604852 (bw_between)
Fri Aug 7 09:21:46 2015 TokuFT recovery turning around at begin checkpoint 957841798 time 29360696
Fri Aug 7 09:21:46 2015 TokuFT recovery starts scanning forward to 957963270 from 957841798 left 121472 (fw_between)
Fri Aug 7 09:22:10 2015 TokuFT lsn 957930340 commit xid 405185:0 2105344/293672232 1%
Fri Aug 7 09:22:40 2015 TokuFT lsn 957930340 commit xid 405185:0 5748736/293672232 2%
Fri Aug 7 09:23:10 2015 TokuFT lsn 957930340 commit xid 405185:0 9150464/293672232 3%
--- snip --
Fri Aug 7 09:36:40 2015 TokuFT lsn 957930340 commit xid 405185:0 115834880/293672232 39%
Fri Aug 7 09:36:55 2015 TokuFT lsn 957930340 commit xid 405185:0 118014976/293672232 40%
Fri Aug 7 09:37:10 2015 TokuFT lsn 957930340 commit xid 405185:0 120132608/293672232 41%
Fri Aug 7 09:37:25 2015 TokuFT lsn 957930340 commit xid 405185:0 122251264/293672232 42%
Fri Aug 7 09:37:40 2015 TokuFT lsn 957930340 commit xid 405185:0 124372992/293672232 42%
-- snip --
Fri Aug 7 09:38:40 2015 TokuFT lsn 957930340 commit xid 405185:0 133026816/293672232 45%
Fri Aug 7 09:38:55 2015 TokuFT lsn 957930340 commit xid 405185:0 135154688/293672232 46%
Fri Aug 7 09:39:25 2015 TokuFT lsn 957930340 commit xid 405185:0 139439104/293672232 47%
-- snip --
Fri Aug 7 09:41:10 2015 TokuFT lsn 957930340 commit xid 405185:0 154284032/293672232 53%
Fri Aug 7 09:41:25 2015 TokuFT lsn 957930340 commit xid 405185:0 156420096/293672232 53%
Fri Aug 7 09:41:40 2015 TokuFT lsn 957930340 commit xid 405185:0 158516224/293672232 54%

150807 09:41:42 mysqld_safe Transparent huge pages are already set to: never.
150807 09:41:42 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql

So while the log sequence and the TokuFT commit xid numbers seems to stay static, the scanned checkpoints and % complete (always ~49-55%) change from run to run.

It’s unclear if this is somehow unrolling some gigantic transaction (the user had done some huge table updates, DDL changes, etc recently and I suspect had autocommit=0) or this is loop that’s never going to end.

Any thoughts on how I can determine if this is a valid recovery or if more drastic measures are in order.

Thanks.

In the looking in all the wrong places department. Turned out to be systemd helpfully killing and restarting Percona after a 10 minute timeout. I was so glued to the mysql log, I didn’t look at the system log where I would have seen the plain as day systemd error.

I had a similar problem and increased TimeoutSec to 3000 in /usr/lib/systemd/system/mysqld.service. 10 minutes is really low when Mysql is recovering from a crash.