XtraBackup locks dual-node cluster?

sindarina · March 15, 2016, 10:56am

We run a dual-node cluster for a customer, where the primary node is used in production, and the secondary node is for manual failover, backups, data exports and reporting and such, and they had a problem today where their pool would fill up every hour, LOCK WAIT, because commits are taking too long to execute. So far we’ve tracked this down to the XtraBackup snapshots being created on the secondary node every hour, and disabling those removes the recurring problem.

The MySQL error log shows the following type of warning when this happens;

==
2016-03-15 14:33:47 11632 [Warning] WSREP: galera/src/galera_service_thd.cpp:thd_func():60: Failed to report last committed 17922712, -4 (Interrupted system call)
2016-03-15 14:35:58 11632 [Warning] WSREP: gcs/src/gcs_sm.hpp:_gcs_sm_enqueue_common():210: send monitor wait timed out, waited for PT12.377S

I am confused as to where this is coming from, however; innobackupex is supposedly non-blocking, and we’ve not changed anything in this backup procedure for at least six months or so. This is the first time we’ve noticed it as a big enough problem that it actually disrupts production. I suspect we’re hitting some kind of threshold in terms of database size and concurrent traffic, but the backups are very deliberately not run on the primary node for this exact reason.

innobackupex runs with basic options; defaults-extra file for username and password, daily full, hourly incremental, nothing else customised.

Are we overlooking something obvious, maybe? Do we need to specify additional options above a certain size?

sindarina · March 19, 2016, 5:00am

After poking at this for several days the most logical explanation for this seems to be AWS I/O; increasing iowait during XtraBackup snapshots, on the secondary cluster node, drags down the performance of the primary node, leading to commit times that are way too high, pool fills up, application starts failing.

We already moved from a ‘t2.large’ to a ‘m4.large’ instance for the secondary cluster node, since AWS seems to supply M-type instances with minimum guaranteed I/O, but the iowait actually seems to have gone UP since then. Does anyone else have this experience? What does one do for a Galera cluster if your neighbours on the AWS pull heavy I/O as well?

Topic		Replies	Views
Recent issues with "Waiting for bakup lock" Percona XtraDB Cluster 5.x troubleshooting , mysql , percona	8	4848	July 14, 2020
Backup locking other nodes Percona XtraBackup	4	817	March 18, 2022
Adding Nodes With Minimal Downtime? Percona XtraDB Cluster 5.x	1	697	April 8, 2019
XtraBackup fails when locking tables Percona XtraBackup	3	1462	December 25, 2013
Nodes terminated when addeted new Percona XtraDB Cluster 5.x	3	901	March 27, 2015

XtraBackup locks dual-node cluster?

Related topics