Recent issues with "Waiting for bakup lock"

Hi,Some help please. Over the last couple of days, we’re been having issues with our cluster. When a node goes down for some reason and starts coming back over SST, we’ve been getting a status of “Waiting for backup lock” when a “CREATE” or “DROP” table query is executed. Both are executed on InnoDB tables.


 
What this does is locks till the node is completely back up, which can take up to an hour. This renders the whole cluster unavailable during this time.Any ideas what is causing this? I’ve been running this cluster for 4+ years and this is the first time seeing this.Thanks!

Hi @tucj7
It would be great if you can share version details of xtrabakup and cluster.

Hi,
No problem:

Server version: 5.7.26-29-57 Percona XtraDB Cluster (GPL), Release rel29, Revision 03540a3, WSREP version 31.37, wsrep_31.37xtrabackup version 2.4.15 based on MySQL server 5.7.19 Linux (x86_64) (revision id: 544842a)



This is what’s running during the SST


It looks like it could be related to this: https://jira.percona.com/browse/PXC-2365Can someone confirm that this is the same issue? Is there a resolution to this? Or, at least, a workaround?

Hi @tucj7
Yes you are right.
As per my understanding i was expecting this issue on older version, but as you have shared it is not the issue in your case.
Looking at above jira it is appears its a bug. May be someone can recommend interim solutions, if any.





@“lorraine.pocklington” is this something you could check out?

Hi
Let me start by saying that the behaviour you are experience is expected by design. On 5.7, the xtrabackup command used for SST includes the parameter --lock-ddl (https://github.com/percona/percona-xtradb-cluster/blob/5.7/scripts/wsrep_sst_xtrabackup-v2.sh#L1582) which will execute the LOCK TABLES FOR BACKUP command (https://www.percona.com/doc/percona-xtrabackup/LATEST/xtrabackup_bin/xbk_option_reference.html#cmdoption-lock-ddl
All this is to guarantee consistency of the backup.
The fix is to avoid running DDLs on the cluster during the SST process (or at least at the beginning of it where all the backup lock happens) 

Thanks for the feedback - that makes sense. The trouble I have is that when this happens it locks up the DB for the entire SST process, which currently runs around 1.5 hours.
We have a bunch of jobs (and client initiated jobs) that run regularly that create/drop temp tables and it’s impossible to know when a node will go down and cause SST. It turns out this was likely happening due to a memory issue on one node.
Do you have any recommendations on how to anticipate an SST and then push a change to crons/etc. to prevent DDL during this process? Or is there a better way to get around this. My constant fear is that this happens at critical times or after hours and causes considerable problems.