We are having a percona cluster (
Percona-XtraDB-Cluster-server-57-5.7.31-31.45.2.el7.x86_64) created with 3 nodes. A ProxySQL (
proxysql2-2.0.13-1.1.el7.x86_64) is configured to monitor and switch traffic between nodes if there’s a node’s outage.
We have monitoring account disable (
mysql-monitor_enabled = false) in order to move traffic manual when we like to do so.
In the time we like to do a maintenance break with the active writer node, we’ve did following procedure for moving traffic from current writer node to another node (from readers):
UPDATE mysql_servers SET status = "OFFLINE_HARD" WHERE hostname = '<CURRENT_WRITER>' AND hostgroup_id = '10'; UPDATE mysql_servers SET status = "ONLINE" WHERE hostname = '<NEW_WRITER>' AND hostgroup_id = '10'; LOAD MYSQL SERVERS TO RUNTIME; SAVE MYSQL SERVERS TO DISK;
In the time we’ve ran this commands, we’ve had max_writers = 1 in mysql_galera_hostgroups table.
The whole procedure went well and we have switched traffic with success. After maintenance for the old writer is done, we have used same procedure to introduce the back back as writer, but set new current one as offline_hard. We have used that procedure more than once and never (until now) had a problem with it. But last time Percona cluster went down with following message in the logs:
2021-03-17T07:51:06.877604Z 8 [Note] InnoDB: BF-BF X lock conflict,mode: 1027 supremum: 0 2021-03-17T07:51:06.889688Z 8 [Note] InnoDB: conflicts states: my: 0 locked: 0 RECORD LOCKS space id 483161 page no 83453 n bits 120 index PRIMARY of table `DatabaseName`.`TableName` trx id 89290309687 lock_mode X locks rec but not gap
When we have debugged it more further, we have found that conflict was caused by a transactions that are still ongoing on new current writer node when we’ve switched back to original writer. So cluster stopped working because it couldnt resolve the conflict.
We can see that there’s a procedure for maintenance moving/switching to another node documented in Percona’s document pages → proxysql-switch-traffic, and you are suggesting to stop receiving queries (
set global wsrep_reject_queries=all;) on the node that we like to remove from writing, but when we trying to use it we experienced some 500s in the app (
SQLSTATE[HY000]: General error: 2013 Lost connection to MySQL server during query). It is same if we are using monitoring account enabled or doing it manually.
If we understood correctly - safest way to have healthy cluster is to stop all running (or any future) queries (with wsrep reject all queries) in the active writer and then switch to another one, but this will bring a small downtime (for a second or two) to our application. Is this correct or we are missing something? Any suggestions or recommendations will be helpful. Thank you