Hello.
We are having issues when trying to backup our Percona XtraDB Cluster 8.0.23 with Xtrabackup 8.
Our setup is as follows:
Two nodes (N1, N2) with Arbiter (A).
App is connecting only to the N1, N2 is only a failover and host for creating backups.
We are trying to create backup running this command on N2:
xtrabackup --ftwrl-wait-timeout=180 --lock-ddl-timeout=180 --backup-lock-timeout=180 --stream=xbstream --backup --target-dir /mnt/db/backup 2> /mnt/db/backup/stream_backup.log | pigz - > /mnt/db/backup/$(date +%F-%H-%M-%S).tar.gz
Most of the time we end up with this error message:
Error: failed to execute query âSELECT server_uuid, local, replication, storage_engines FROM performance_schema.log_statusâ: 1205 (HY000) Lock wait timeout exceeded; try restarting transaction
It also causes whole cluster to stop responding near the end of the backup (with âtoo many connectionsâ error message), but resumes normal operations as soon as backup fails or completes succesfully.
We also tried the --safe-slave-backup but cluster lockup is even more noticable so we had to abort.
Any ideas?
2 Likes
Hello @mkconel,
I find your xtrabackup command to be excessively complex. Hereâs what I use when backing up PXC:
xtrabackup --backup --parallel 4 --galera-info --stream xbstream --target-dir /path/datadir >backup.xbstream
Try the above first and see if that fixes your issue.
Also, pigz does not create tar archives so creating files as .tar.gz is incorrect.
2 Likes
Hi.
Unfortunately, it does not work - same error message as before.
Error: failed to execute query âSELECT server_uuid, local, replication, storage_engines FROM performance_schema.log_statusâ: 1205 (HY000) Lock wait timeout exceeded; try restarting transaction
1 Like
Also we gave ââsafe-slave-backupâ option a go overnight, but no luck either. It created considerable downtime and still resulted in a error.
211008 03:04:39 [01] Streaming test/db.opt to
211008 03:04:39 [01] âŠdone
211008 03:04:39 Finished backing up non-InnoDB tables and files
211008 03:04:39 Executing FLUSH NO_WRITE_TO_BINLOG BINARY LOGS
211008 03:04:39 Selecting LSN and binary log position from p_s.log_status
211008 03:04:40 >> log scanned up to (267586782003)
211008 03:04:41 >> log scanned up to (267586782003)
211008 03:04:42 >> log scanned up to (267586782003)
211008 03:04:43 >> log scanned up to (267586782003)
211008 03:04:44 >> log scanned up to (267586782003)
211008 03:04:45 >> log scanned up to (267586782003)
211008 03:04:46 >> log scanned up to (267586782003)
211008 03:04:47 >> log scanned up to (267586782003)
211008 03:04:48 >> log scanned up to (267586782003)
211008 03:04:49 >> log scanned up to (267586782003)
211008 03:04:50 >> log scanned up to (267586782003)
211008 03:04:51 >> log scanned up to (267586782003)
211008 03:04:52 >> log scanned up to (267586782003)
211008 03:04:53 >> log scanned up to (267586782003)
211008 03:04:54 >> log scanned up to (267586782003)
211008 03:04:55 >> log scanned up to (267586782003)
211008 03:04:56 >> log scanned up to (267586782003)
211008 03:04:57 >> log scanned up to (267586782003)
211008 03:04:58 >> log scanned up to (267586782003)
211008 03:04:59 >> log scanned up to (267586782003)
211008 03:05:00 >> log scanned up to (267586782003)
211008 03:05:01 >> log scanned up to (267586782003)
211008 03:05:02 >> log scanned up to (267586782003)
211008 03:05:03 >> log scanned up to (267586782003)
211008 03:05:04 >> log scanned up to (267586782003)
211008 03:05:05 >> log scanned up to (267586782003)
211008 03:05:06 >> log scanned up to (267586782003)
211008 03:05:07 >> log scanned up to (267586782003)
211008 03:05:08 >> log scanned up to (267586782003)
211008 03:05:09 >> log scanned up to (267586782003)
Error: failed to execute query âSELECT server_uuid, local, replication, storage_engines FROM performance_schema.log_statusâ: 1205 (HY000) Lock wait timeout exceeded; try restarting transaction
1 Like
You issue is happening while xtrabackup is querying performance_schema.log_status - this table provides a snapshot of all engines + gtid / binlog.
For InnoDB (the part that seems to be timming out for you) it requires log buffer and checkpoint lock/mutex. However, someone else has it during this time and has not released withing 30 seconds.
Will it possible for you to capture below list of statements after xtrabackup has issue the SELECT against log_status and before it timesout:
SHOW ENGINE INNODB STATUS\G
SHOW FULL PROCESSLIST;
- â Run before starting the backup â
UPDATE performance_schema.setup_instruments SET ENABLED = 'YES', TIMED = 'YES' WHERE NAME LIKE 'wait/synch/mutex/innodb/log_sn_mutex';
UPDATE performance_schema.setup_instruments SET ENABLED = 'YES', TIMED = 'YES' WHERE NAME LIKE 'wait/synch/mutex/innodb/log_checkpointer_mutex';
UPDATE performance_schema.setup_consumers
SET ENABLED = 'YES'
WHERE NAME LIKE 'events_waits%';
â Once backup waits on log_status:
SELECT * FROM performance_schema.events_waits_current w JOIN mutex_instances m USING(OBJECT_INSTANCE_BEGIN) JOIN threads t1 ON m.LOCKED_BY_THREAD_ID=t1.THREAD_ID JOIN threads t2 ON w.THREAD_ID=t2.THREAD_ID \G
Thanks in advance.
1 Like
We have finally found a solution to this problem.
According to your advice, we checked what exactly is running on our cluster when backup is being made and found out that at very specific moment our app launches few queries like the one below.
ALTER TABLE table AUTO_INCREMENT = 1
And just so happens that the moment these queries are launched Xtrabackup is in the middle of creating a backup.
When we have disabled those queries, all issues went away.
Thank you for your help.
1 Like