Cluster fails replication and partitions node

I’m hoping that someone can help out with this. We have a Percona-XtraDB 3-node Galera cluster running each database for our Appsuite setup. This makes 12 nodes in groups of 3 – first cluster running configdb, then 3 clusters running 3 sets of user databases. Over the top of these 3 nodes we have 2 HAProxy load balancers; a ‘read’ one which balances between all 3 nodes of the cluster, and a ‘write’ one which exclusively uses the first node and fails over to the 2nd and 3rd. We have an identical (but smaller) setup in another environment as well.

During periods of heavy write usage, we have a problem where one of the read nodes will fail to replicate, will be partitioned from the cluster, and shuts down. When we subsequently try to bring it up, it will do a SST, but then when it tries to do a final IST and rejoin the cluster, it fails and so fails startup, and exits. This means we cannot restart the node.

The only way I have found to restart the node is to shut down the entire cluster, then start up the known good node, and make the other nodes join in. This clears the node partitioning and allows them to join successfully, though it takes a long time for the initial SST.

The questions I have are -

  • Why does the node get partitioned and exit in the first place? Is there an issue with the replication or with the load balancing, or is it something to do with how AppSuite writes to the cluster?
  • Why does the node not come back up correctly without fully reinitialising the cluster? Is there a way to clear the partitioning without a restart?
  • Is there anything we have configured incorrectly in the Galera setup that could have caused this problem?

The startup issue seems to be related to using xtrabackup for the WSREP rather than rsync but this is required for availability.

We do not have this problem on our other clusters, though this one has by far the highest load. The error messages seem to indicate inconsistency, but I had thought this was not possible on a Galera cluster as you can write to any node - and in any case, we’re writing to only the first node.

Any information about this would be appreciated.

Steve

Server OS:
Centos7 64bit; 4vCPU, 36GB

Installed packages:
Percona-XtraDB-Cluster-galera-3-3.17-1.rhel7.x86_64
Percona-XtraDB-Cluster-shared-56-5.6.32-25.17.1.el7.x86_64
Percona-XtraDB-Cluster-server-56-5.6.32-25.17.1.el7.x86_64
Percona-XtraDB-Cluster-client-56-5.6.32-25.17.1.el7.x86_64
Percona-XtraDB-Cluster-56-5.6.32-25.17.1.el7.x86_64
percona-xtrabackup-2.3.5-1.el7.x86_64

Logs on failed node:

2017-01-17T09:28:34.408822+00:00 xm1adb05 mysqld: 2017-01-17 09:28:34 1786 [ERROR] Slave SQL: Could not execute Write_rows event on table cluster02-write_29.task_folder; Cannot add or update a child row: a foreign key constraint fails (cluster02-write_29.task_folder, CONSTRAINT task_folder_ibfk_2 FOREIGN KEY (cid, folder) REFERENCES oxfolder_tree (cid, fuid)), Error_code: 1452; handler error HA_ERR_NO_REFERENCED_ROW; the event’s master log FIRST, end_log_pos 560, Error_code: 1452 2017-01-17T09:28:34.408829+00:00 xm1adb05 mysqld: 2017-01-17 09:28:34 1786 [Warning] WSREP: RBR event 7 Write_rows apply warning: 151, 21578565 2017-01-17T09:28:34.409672+00:00 xm1adb05 mysqld: 2017-01-17 09:28:34 1786 [Warning] WSREP: Failed to apply app buffer: seqno: 21578565, status: 1 2017-01-17T09:28:34.409678+00:00 xm1adb05 mysqld: #011 at galera/src/trx_handle.cpp:apply():351 2017-01-17T09:28:34.409680+00:00 xm1adb05 mysqld: Retrying 2th time … 2017-01-17T09:28:34.479661+00:00 xm1adb05 mysqld: 2017-01-17 09:28:34 1786 [ERROR] WSREP: Failed to apply trx: source: 5ef3f45a-db60-11e6-9045-96a1ffb3c027 version: 3 local: 0 state: APPLYING flags: 1 conn_id: 408736 trx_id: 93728670 seqnos (l: 984067, g: 21578565, s: 21578564, d: 21578448, ts: 9191936555185038) 2017-01-17T09:28:34.479664+00:00 xm1adb05 mysqld: 2017-01-17 09:28:34 1786 [ERROR] WSREP: Failed to apply trx 21578565 4 times 2017-01-17T09:28:34.479667+00:00 xm1adb05 mysqld: 2017-01-17 09:28:34 1786 [ERROR] WSREP: Node consistency compromized, aborting… 2017-01-17T09:28:34.479669+00:00 xm1adb05 mysqld: 2017-01-17 09:28:34 1786 [Note] WSREP: Closing send monitor… 2017-01-17T09:28:34.479671+00:00 xm1adb05 mysqld: 2017-01-17 09:28:34 1786 [Note] WSREP: Closed send monitor. 2017-01-17T09:28:34.479673+00:00 xm1adb05 mysqld: 2017-01-17 09:28:34 1786 [Note] WSREP: gcomm: terminating thread 2017-01-17T09:28:34.479675+00:00 xm1adb05 mysqld: 2017-01-17 09:28:34 1786 [Note] WSREP: gcomm: joining thread 2017-01-17T09:28:34.480798+00:00 xm1adb05 mysqld: 2017-01-17 09:28:34 1786 [Note] WSREP: gcomm: closing backend 2017-01-17T09:28:35.481116+00:00 xm1adb05 mysqld: 2017-01-17 09:28:35 1786 [Note] WSREP: view(view_id(NON_PRIM,5ef3f45a,5) memb { 2017-01-17T09:28:35.481122+00:00 xm1adb05 mysqld: #011c359ce6e,0 2017-01-17T09:28:35.481124+00:00 xm1adb05 mysqld: } joined { 2017-01-17T09:28:35.481126+00:00 xm1adb05 mysqld: } left { 2017-01-17T09:28:35.481128+00:00 xm1adb05 mysqld: } partitioned { 2017-01-17T09:28:35.481129+00:00 xm1adb05 mysqld: #0115ef3f45a,0 2017-01-17T09:28:35.481131+00:00 xm1adb05 mysqld: #011affb7638,0 2017-01-17T09:28:35.481133+00:00 xm1adb05 mysqld: }) 2017-01-17T09:28:35.481136+00:00 xm1adb05 mysqld: 2017-01-17 09:28:35 1786 [Note] WSREP: view((empty)) 2017-01-17T09:28:35.481287+00:00 xm1adb05 mysqld: 2017-01-17 09:28:35 1786 [Note] WSREP: gcomm: closed 2017-01-17T09:28:35.481401+00:00 xm1adb05 mysqld: 2017-01-17 09:28:35 1786 [Warning] WSREP: 0x1a83e28 down context(s) not set 2017-01-17T09:28:35.481405+00:00 xm1adb05 mysqld: 2017-01-17 09:28:35 1786 [Warning] WSREP: gcs_caused() returned -107 (Transport endpoint is not connected) … 2017-01-17T09:28:35.481675+00:00 xm1adb05 mysqld: 2017-01-17 09:28:35 1786 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1 2017-01-17T09:28:35.481679+00:00 xm1adb05 mysqld: 2017-01-17 09:28:35 1786 [Note] WSREP: Flow-control interval: [16, 16] 2017-01-17T09:28:35.481681+00:00 xm1adb05 mysqld: 2017-01-17 09:28:35 1786 [Note] WSREP: Received NON-PRIMARY. 2017-01-17T09:28:35.481683+00:00 xm1adb05 mysqld: 2017-01-17 09:28:35 1786 [Note] WSREP: Shifting SYNCED → OPEN (TO: 21578565) 2017-01-17T09:28:35.481685+00:00 xm1adb05 mysqld: 2017-01-17 09:28:35 1786 [Warning] WSREP: gcs_caused() returned -107 (Transport endpoint is not connected) 2017-01-17T09:28:35.481687+00:00 xm1adb05 mysqld: 2017-01-17 09:28:35 1786 [Note] WSREP: Received self-leave message. 2017-01-17T09:28:35.481689+00:00 xm1adb05 mysqld: 2017-01-17 09:28:35 1786 [Note] WSREP: Flow-control interval: [0, 0] 2017-01-17T09:28:35.481690+00:00 xm1adb05 mysqld: 2017-01-17 09:28:35 1786 [Note] WSREP: Received SELF-LEAVE. Closing connection. 2017-01-17T09:28:35.481692+00:00 xm1adb05 mysqld: 2017-01-17 09:28:35 1786 [Note] WSREP: Shifting OPEN → CLOSED (TO: 21578565) 2017-01-17T09:28:35.481694+00:00 xm1adb05 mysqld: 2017-01-17 09:28:35 1786 [Note] WSREP: RECV thread exiting 0: Success 2017-01-17T09:28:35.481696+00:00 xm1adb05 mysqld: 2017-01-17 09:28:35 1786 [Warning] WSREP: gcs_caused() returned -103 (Software caused connection abort) 2017-01-17T09:28:35.481727+00:00 xm1adb05 mysqld: 2017-01-17 09:28:35 1786 [Warning] WSREP: gcs_caused() returned -103 (Software caused connection abort) 2017-01-17T09:28:35.481942+00:00 xm1adb05 mysqld: 2017-01-17 09:28:35 1786 [Warning] WSREP: gcs_caused() returned -103 (Software caused connection abort) 2017-01-17T09:28:35.482717+00:00 xm1adb05 mysqld: 2017-01-17 09:28:35 1786 [Note] WSREP: recv_thread() joined. 2017-01-17T09:28:35.482721+00:00 xm1adb05 mysqld: 2017-01-17 09:28:35 1786 [Note] WSREP: Closing replication queue. 2017-01-17T09:28:35.482724+00:00 xm1adb05 mysqld: 2017-01-17 09:28:35 1786 [Note] WSREP: Closing slave action queue. 2017-01-17T09:28:35.482726+00:00 xm1adb05 mysqld: 2017-01-17 09:28:35 1786 [Note] WSREP: /usr/sbin/mysqld: Terminated. 2017-01-17T09:28:35.706513+00:00 xm1adb05 mysqld_safe: 170117 09:28:35 mysqld_safe Number of processes running now: 0 2017-01-17T09:28:35.706801+00:00 xm1adb05 mysqld_safe: Number of processes running now: 0 2017-01-17T09:28:35.707944+00:00 xm1adb05 mysqld_safe: 170117 09:28:35 mysqld_safe WSREP: not restarting wsrep node automatically 2017-01-17T09:28:35.708403+00:00 xm1adb05 mysqld_safe: WSREP: not restarting wsrep node automatically 2017-01-17T09:28:35.709663+00:00 xm1adb05 mysqld_safe: 170117 09:28:35 mysqld_safe mysqld from pid file /run/mysqld/mysql.pid ended 2017-01-17T09:28:35.709974+00:00 xm1adb05 mysqld_safe: mysqld from pid file /run/mysqld/mysql.pid ended 2017-01-17T09:28:35.727648+00:00 xm1adb05 mysql-systemd: /usr/bin/mysql-systemd: line 283: kill: (1786) - No such process 2017-01-17T09:28:35.727781+00:00 xm1adb05 mysql-systemd: ERROR! Failed to stop the server with PID: 1786 with 1 status 2017-01-17T09:28:35.744431+00:00 xm1adb05 mysql-systemd: /usr/bin/mysql-systemd: line 226: kill: (1786) - No such process 2017-01-17T09:28:35.744542+00:00 xm1adb05 mysql-systemd: WARNING: mysql already dead 2017-01-17T09:28:35.744584+00:00 xm1adb05 mysql-systemd: ERROR! Stale PID file: /run/mysqld/mysql.pid 2017-01-18T00:49:17.673700+00:00 xm1adb05 mysql-systemd: /usr/bin/mysql-systemd: line 102: kill: (1786) - No such process

I assume that this is the log for the initial failure. Can you upload the logs for the SST/IST failure?

(this is a guess)
From the logs above, there’s a foreign key constraint violation. It may be that galera reordered the transaction which may lead to this failure (which may explain why doing a full cluster reboot works rather than an IST). But I would expect all the read nodes to fail rather than just one of them.

For a workaround, rather than restarting the cluster, maybe forcing a full SST rather than an SST/IST on the failed node may be helpful. On the failed node, delete the grastate.dat from the data directory and then rejoin the cluster.

Deleting the grastate and restarting the failed node did not help; the only way to get back into the cluster was to shut down the cluster completely, restart the undamaged nodes, then delete the grastate and force a full SST on the damaged now.

The root of the problem seems to be using wsrp_slave_threads=16 (a 4cpu host) when the application using the database was VERY write-heavy and had a database with foreign key constraints. The parallel apply resulted in a race condition and some changes were applied out of order, corrupting the recipient’s database and breaking things. Setting wsrp_slave_threads=1 prevented the problem from recurring (though it then caused replication bottlenecks…)