I’m running a 3 nodes cluster (5.7.23) and every time a node rejoin to the cluster, the donor one becomes unresponsive for write queries. The SHOW PROCESSLIST shows that these write queries are stuck at state
wsrep: initiating pre-commit for write set.
According to State Snapshot Transfer, the xtrabackup_v2 should not lock donor node on write.
I think my issue is related to Donor replication queue overflow during SST - #10 by tucj7 but unfortunately, the issue does not resolved.
FYI: I think my issue is related to Percona JIRA but the jira ticket is just an improvement while it should be a high priority bug.
Do you have DDLs running while node is joining?
Yes, there are some DDLs as what I remember. If you need I can try to reproduce. I also sure that there are some DCL granting access which is stuck at checking permission state.
So that’s essentially your issue. You need to prevent DDLs/DCLs while nodes are joining. Constant DDLs are not best-practice and should only be happening during maintenance windows.
Hi Matt, sorry for late reply. But even this is what Percona was designed. Do you think that Percona should raise an exception in this case when user start the DDL? Rather than rely the whole things to users and cause the whole system hang.
Is there any way to kill the DDL to temporarily resolve the issue. If it happens again (may be by mistake but not our design), it will take us a day for the replication to be complete and it’s a significant unacceptable downtime. Besides, sometime the replication is started automatically by autoscaling group and it’s not an easy process for us to disable every things related before doing the DDL maintenance.
Secondly, nowadays migration framework may make the DB DDL run more frequently and be transparented from developer like Django. Just a new model defined can cause this.
You would need to upgrade to PXC 8.0 which would allow the use of Xtrabackup 8.0 during the SST process. There is a feature in XB8 which blocks DDLs during the backup process.
Secondly, nowadays migration framework may make the DB DDL run more frequently and be transparented from developer like Django.
It is the responsibility of developers to understand what their code is doing and understand the underlying frameworks. If they are not aware of the consequences of their actions, they shouldn’t be allowed to operate on a production environment.