Restarting servers causes cluster to die with duplicate keys

I’ve noticed this a couple of times, but caught a good example today.

I wanted to apply patches to the 3 machines in my cluster. So I then stopped mysql. I watched the log to see that it terminated gracefully, then rebooted the machines and watched it re-enter the cluster. Then repeated on the 2 other machines. The restart was quick, so the node caught up with an IST.

A few hours later, I saw a termination with logs like this.

[HTML]2019-12-02T19:49:11.916587Z 19 [ERROR] Slave SQL: Could not execute Write_rows event on table iznik.users_emails; Duplicate entry ‘redacted’ for key ‘email’, Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event’s master log FIRST, end_log_pos 327, Error_code: 1062
2019-12-02T19:49:11.916607Z 19 [Warning] WSREP: RBR event 3 Write_rows apply warning: 121, 3114871165
2019-12-02T19:49:11.916707Z 19 [Note] WSREP: Applier statement rollback needed
2019-12-02T19:49:11.916749Z 19 [Warning] WSREP: Failed to apply app buffer: seqno: 3114871165, status: 1
at galera/src/trx_handle.cpp:apply():353
Retrying 2th time
2019-12-02T19:49:11.917039Z 19 [ERROR] Slave SQL: Could not execute Write_rows event on table iznik.users_emails; Duplicate entry ‘redacted’ for key ‘email’, Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event’s master log FIRST, end_log_pos 327, Error_code: 1062
2019-12-02T19:49:11.917054Z 19 [Warning] WSREP: RBR event 3 Write_rows apply warning: 121, 3114871165
2019-12-02T19:49:11.917175Z 19 [Note] WSREP: Applier statement rollback needed
2019-12-02T19:49:11.917222Z 19 [Warning] WSREP: Failed to apply app buffer: seqno: 3114871165, status: 1
at galera/src/trx_handle.cpp:apply():353
Retrying 3th time
2019-12-02T19:49:11.917488Z 19 [ERROR] Slave SQL: Could not execute Write_rows event on table iznik.users_emails; Duplicate entry ‘redacted’ for key ‘email’, Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event’s master log FIRST, end_log_pos 327, Error_code: 1062
2019-12-02T19:49:11.917503Z 19 [Warning] WSREP: RBR event 3 Write_rows apply warning: 121, 3114871165
2019-12-02T19:49:11.917611Z 19 [Note] WSREP: Applier statement rollback needed
2019-12-02T19:49:11.917656Z 19 [Warning] WSREP: Failed to apply app buffer: seqno: 3114871165, status: 1
at galera/src/trx_handle.cpp:apply():353
Retrying 4th time
2019-12-02T19:49:11.917843Z 19 [ERROR] Slave SQL: Could not execute Write_rows event on table iznik.users_emails; Duplicate entry ‘redacted’ for key ‘email’, Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event’s master log FIRST, end_log_pos 327, Error_code: 1062
2019-12-02T19:49:11.917857Z 19 [Warning] WSREP: RBR event 3 Write_rows apply warning: 121, 3114871165
2019-12-02T19:49:11.917940Z 19 [Note] WSREP: Applier statement rollback needed
2019-12-02T19:49:11.930674Z 0 [Note] WSREP: (67df1a46, ‘tcp://0.0.0.0:4567’) turning message relay requesting on, nonlive peers: tcp://89.16.163.155:4567
2019-12-02T19:49:11.941964Z 19 [ERROR] WSREP: Failed to apply trx: source: 69719331-151a-11ea-883e-5aafee641567 version: 4 local: 0 state: APPLYING flags: 1 conn_id: 1623989 trx_id: 21950552 seqnos (l: 352675, g: 3114871165, s: 3114871164, d: 3114871134, ts: 14780336992255)
2019-12-02T19:49:11.941989Z 19 [ERROR] WSREP: Failed to apply trx 3114871165 4 times
2019-12-02T19:49:11.941995Z 19 [ERROR] WSREP: Node consistency compromised, aborting…
[/HTML]

All three nodes terminated and I had to bootstrap the cluster again and force an SST on each of the other two nodes to ensure I had databases that were in sync.

Has anyone seen anything like that?

I came across this old reference to problems with multiple wsrep_slave_threads.

https://bugs.launchpad.net/codership-mysql/+bug/1280896

And this more recent one which suggests that it may not yet be safe to set wsrep_slave_threads > 1.

https://bugs.launchpad.net/ubuntu/+source/percona-xtradb-cluster-5.6/+bug/1823850