XtraBackup locks dual-node cluster?

We run a dual-node cluster for a customer, where the primary node is used in production, and the secondary node is for manual failover, backups, data exports and reporting and such, and they had a problem today where their pool would fill up every hour, LOCK WAIT, because commits are taking too long to execute. So far we’ve tracked this down to the XtraBackup snapshots being created on the secondary node every hour, and disabling those removes the recurring problem.

The MySQL error log shows the following type of warning when this happens;

2016-03-15 14:33:47 11632 [Warning] WSREP: galera/src/galera_service_thd.cpp:thd_func():60: Failed to report last committed 17922712, -4 (Interrupted system call)
2016-03-15 14:35:58 11632 [Warning] WSREP: gcs/src/gcs_sm.hpp:_gcs_sm_enqueue_common():210: send monitor wait timed out, waited for PT12.377S

I am confused as to where this is coming from, however; innobackupex is supposedly non-blocking, and we’ve not changed anything in this backup procedure for at least six months or so. This is the first time we’ve noticed it as a big enough problem that it actually disrupts production. I suspect we’re hitting some kind of threshold in terms of database size and concurrent traffic, but the backups are very deliberately not run on the primary node for this exact reason.

innobackupex runs with basic options; defaults-extra file for username and password, daily full, hourly incremental, nothing else customised.

Are we overlooking something obvious, maybe? Do we need to specify additional options above a certain size?

After poking at this for several days the most logical explanation for this seems to be AWS I/O; increasing iowait during XtraBackup snapshots, on the secondary cluster node, drags down the performance of the primary node, leading to commit times that are way too high, pool fills up, application starts failing.

We already moved from a ‘t2.large’ to a ‘m4.large’ instance for the secondary cluster node, since AWS seems to supply M-type instances with minimum guaranteed I/O, but the iowait actually seems to have gone UP since then. Does anyone else have this experience? What does one do for a Galera cluster if your neighbours on the AWS pull heavy I/O as well?