We run a dual-node cluster for a customer, where the primary node is used in production, and the secondary node is for manual failover, backups, data exports and reporting and such, and they had a problem today where their pool would fill up every hour, LOCK WAIT, because commits are taking too long to execute. So far we’ve tracked this down to the XtraBackup snapshots being created on the secondary node every hour, and disabling those removes the recurring problem.
The MySQL error log shows the following type of warning when this happens;
==
2016-03-15 14:33:47 11632 [Warning] WSREP: galera/src/galera_service_thd.cpp:thd_func():60: Failed to report last committed 17922712, -4 (Interrupted system call)
2016-03-15 14:35:58 11632 [Warning] WSREP: gcs/src/gcs_sm.hpp:_gcs_sm_enqueue_common():210: send monitor wait timed out, waited for PT12.377S
I am confused as to where this is coming from, however; innobackupex is supposedly non-blocking, and we’ve not changed anything in this backup procedure for at least six months or so. This is the first time we’ve noticed it as a big enough problem that it actually disrupts production. I suspect we’re hitting some kind of threshold in terms of database size and concurrent traffic, but the backups are very deliberately not run on the primary node for this exact reason.
innobackupex runs with basic options; defaults-extra file for username and password, daily full, hourly incremental, nothing else customised.
Are we overlooking something obvious, maybe? Do we need to specify additional options above a certain size?