Lock wait timeout exceeded; try restarting transaction

Hello.

We are having issues when trying to backup our Percona XtraDB Cluster 8.0.23 with Xtrabackup 8.

Our setup is as follows:
Two nodes (N1, N2) with Arbiter (A).
App is connecting only to the N1, N2 is only a failover and host for creating backups.

We are trying to create backup running this command on N2:
xtrabackup --ftwrl-wait-timeout=180 --lock-ddl-timeout=180 --backup-lock-timeout=180 --stream=xbstream --backup --target-dir /mnt/db/backup 2> /mnt/db/backup/stream_backup.log | pigz - > /mnt/db/backup/$(date +%F-%H-%M-%S).tar.gz

Most of the time we end up with this error message:
Error: failed to execute query ‘SELECT server_uuid, local, replication, storage_engines FROM performance_schema.log_status’: 1205 (HY000) Lock wait timeout exceeded; try restarting transaction

It also causes whole cluster to stop responding near the end of the backup (with “too many connections” error message), but resumes normal operations as soon as backup fails or completes succesfully.

We also tried the --safe-slave-backup but cluster lockup is even more noticable so we had to abort.

Any ideas?

2 Likes

Hello @mkconel,
I find your xtrabackup command to be excessively complex. Here’s what I use when backing up PXC:

xtrabackup --backup --parallel 4 --galera-info --stream xbstream --target-dir /path/datadir >backup.xbstream

Try the above first and see if that fixes your issue.

Also, pigz does not create tar archives so creating files as .tar.gz is incorrect.

2 Likes

Hi.

Unfortunately, it does not work - same error message as before.

Error: failed to execute query ‘SELECT server_uuid, local, replication, storage_engines FROM performance_schema.log_status’: 1205 (HY000) Lock wait timeout exceeded; try restarting transaction

1 Like

Also we gave “–safe-slave-backup” option a go overnight, but no luck either. It created considerable downtime and still resulted in a error.

211008 03:04:39 [01] Streaming test/db.opt to
211008 03:04:39 [01] 
done
211008 03:04:39 Finished backing up non-InnoDB tables and files
211008 03:04:39 Executing FLUSH NO_WRITE_TO_BINLOG BINARY LOGS
211008 03:04:39 Selecting LSN and binary log position from p_s.log_status
211008 03:04:40 >> log scanned up to (267586782003)
211008 03:04:41 >> log scanned up to (267586782003)
211008 03:04:42 >> log scanned up to (267586782003)
211008 03:04:43 >> log scanned up to (267586782003)
211008 03:04:44 >> log scanned up to (267586782003)
211008 03:04:45 >> log scanned up to (267586782003)
211008 03:04:46 >> log scanned up to (267586782003)
211008 03:04:47 >> log scanned up to (267586782003)
211008 03:04:48 >> log scanned up to (267586782003)
211008 03:04:49 >> log scanned up to (267586782003)
211008 03:04:50 >> log scanned up to (267586782003)
211008 03:04:51 >> log scanned up to (267586782003)
211008 03:04:52 >> log scanned up to (267586782003)
211008 03:04:53 >> log scanned up to (267586782003)
211008 03:04:54 >> log scanned up to (267586782003)
211008 03:04:55 >> log scanned up to (267586782003)
211008 03:04:56 >> log scanned up to (267586782003)
211008 03:04:57 >> log scanned up to (267586782003)
211008 03:04:58 >> log scanned up to (267586782003)
211008 03:04:59 >> log scanned up to (267586782003)
211008 03:05:00 >> log scanned up to (267586782003)
211008 03:05:01 >> log scanned up to (267586782003)
211008 03:05:02 >> log scanned up to (267586782003)
211008 03:05:03 >> log scanned up to (267586782003)
211008 03:05:04 >> log scanned up to (267586782003)
211008 03:05:05 >> log scanned up to (267586782003)
211008 03:05:06 >> log scanned up to (267586782003)
211008 03:05:07 >> log scanned up to (267586782003)
211008 03:05:08 >> log scanned up to (267586782003)
211008 03:05:09 >> log scanned up to (267586782003)
Error: failed to execute query ‘SELECT server_uuid, local, replication, storage_engines FROM performance_schema.log_status’: 1205 (HY000) Lock wait timeout exceeded; try restarting transaction

1 Like

You issue is happening while xtrabackup is querying performance_schema.log_status - this table provides a snapshot of all engines + gtid / binlog.
For InnoDB (the part that seems to be timming out for you) it requires log buffer and checkpoint lock/mutex. However, someone else has it during this time and has not released withing 30 seconds.

Will it possible for you to capture below list of statements after xtrabackup has issue the SELECT against log_status and before it timesout:

  1. SHOW ENGINE INNODB STATUS\G
  2. SHOW FULL PROCESSLIST;
  3. – Run before starting the backup –
UPDATE performance_schema.setup_instruments SET ENABLED = 'YES', TIMED = 'YES' WHERE NAME LIKE 'wait/synch/mutex/innodb/log_sn_mutex';
UPDATE performance_schema.setup_instruments SET ENABLED = 'YES', TIMED = 'YES' WHERE NAME LIKE 'wait/synch/mutex/innodb/log_checkpointer_mutex';
UPDATE performance_schema.setup_consumers
SET ENABLED = 'YES'
WHERE NAME LIKE 'events_waits%';

– Once backup waits on log_status:

SELECT * FROM performance_schema.events_waits_current w JOIN mutex_instances m USING(OBJECT_INSTANCE_BEGIN) JOIN threads t1 ON m.LOCKED_BY_THREAD_ID=t1.THREAD_ID JOIN threads t2 ON w.THREAD_ID=t2.THREAD_ID \G

Thanks in advance.

1 Like

We have finally found a solution to this problem.

According to your advice, we checked what exactly is running on our cluster when backup is being made and found out that at very specific moment our app launches few queries like the one below.

ALTER TABLE table AUTO_INCREMENT = 1

And just so happens that the moment these queries are launched Xtrabackup is in the middle of creating a backup.
When we have disabled those queries, all issues went away.

Thank you for your help.

1 Like