Percona XtraDB Cluster 8.0.36 [Galera] Failed to apply write set: taking wsrep_ready to OFF

We have recently seen this type of error in our various Percona clusters on different nodes. These is the logs from the most recent one. Essentially it sets wsrep_ready to OFF and then ejects the node from the cluster. Would this happen to be a known issue/bug with the software? Is anyone else encountering this?

2024-10-14T13:49:48.289577Z 23 [ERROR] [MY-000000] [Galera] Failed to apply write set: gtid: bf0fd310-ddc1-11ee-b08a-56f7c4b63281:672824661 server_id: d6636fa0-7543-11ef-98e6-333d82fcb296 client_id: 18446744073709551615 trx_id: 601552680 flags: 20 (rollback | pa_unsafe)
2024-10-14T13:49:48.291474Z 23 [Note] [MY-000000] [Galera] Closing send monitor…
2024-10-14T13:49:48.291489Z 23 [Note] [MY-000000] [Galera] Closed send monitor.
2024-10-14T13:49:48.292032Z 23 [Note] [MY-000000] [Galera] gcomm: terminating thread
2024-10-14T13:49:48.292529Z 23 [Note] [MY-000000] [Galera] gcomm: joining thread
2024-10-14T13:49:48.293518Z 23 [Note] [MY-000000] [Galera] gcomm: closing backend
2024-10-14T13:49:48.797958Z 23 [Note] [MY-000000] [Galera] Current view of cluster as seen by this node
view (view_id(NON_PRIM,013128a8-9978,126)
memb {
** d6636fa0-98e6,2**
** }**
joined {
** }**
left {
** }**
partitioned {
** 013128a8-9978,0**
** 1b4b870c-bc73,2**
** 317889e9-b66a,1**
** 8aa87f00-8ba4,1**
** }**
)
2024-10-14T13:49:48.798030Z 23 [Note] [MY-000000] [Galera] PC protocol downgrade 1 → 0
2024-10-14T13:49:48.798041Z 23 [Note] [MY-000000] [Galera] Current view of cluster as seen by this node
view ((empty))
2024-10-14T13:49:48.802080Z 23 [Note] [MY-000000] [Galera] gcomm: closed
2024-10-14T13:49:48.802191Z 0 [Note] [MY-000000] [Galera] New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
2024-10-14T13:49:48.802259Z 0 [Note] [MY-000000] [Galera] Flow-control interval: [64, 64]
2024-10-14T13:49:48.802265Z 0 [Note] [MY-000000] [Galera] Received NON-PRIMARY.
2024-10-14T13:49:48.802269Z 0 [Note] [MY-000000] [Galera] Shifting SYNCED → OPEN (TO: 672824661)
2024-10-14T13:49:48.802285Z 0 [Note] [MY-000000] [Galera] New SELF-LEAVE.
2024-10-14T13:49:48.802363Z 0 [Note] [MY-000000] [Galera] Flow-control interval: [0, 0]
2024-10-14T13:49:48.802376Z 0 [Note] [MY-000000] [Galera] Received SELF-LEAVE. Closing connection.
2024-10-14T13:49:48.802395Z 0 [Note] [MY-000000] [Galera] Shifting OPEN → CLOSED (TO: 672824661)
2024-10-14T13:49:48.802436Z 0 [Note] [MY-000000] [Galera] RECV thread exiting 0: Success
2024-10-14T13:49:48.802464Z 13 [Note] [MY-000000] [Galera] ================================================
View:
** id: bf0fd310-ddc1-11ee-b08a-56f7c4b63281:672824661**
** status: non-primary**
** protocol_version: 4**
** capabilities: MULTI-MASTER, CERTIFICATION, PARALLEL_APPLYING, REPLAY, ISOLATION, PAUSE, CAUSAL_READ, INCREMENTAL_WS, UNORDERED, PREORDERED, STREAMING, NBO**
** final: no**
** own_index: 0**
** members(1):**
** 0: d6636fa0-7543-11ef-98e6-333d82fcb296, node1.db3.cluster**
=================================================
2024-10-14T13:49:48.802491Z 13 [Note] [MY-000000] [Galera] Non-primary view
2024-10-14T13:49:48.802498Z 13 [Note] [MY-000000] [WSREP] Server status change synced → connected
2024-10-14T13:49:48.802815Z 23 [Note] [MY-000000] [Galera] recv_thread() joined.
2024-10-14T13:49:48.802834Z 23 [Note] [MY-000000] [Galera] Closing replication queue.
2024-10-14T13:49:48.802841Z 23 [Note] [MY-000000] [Galera] Closing slave action queue.
2024-10-14T13:49:48.803461Z 13 [Note] [MY-000000] [WSREP] wsrep_notify_cmd is not defined, skipping notification.
2024-10-14T13:49:48.804351Z 13 [Note] [MY-000000] [WSREP] wsrep_notify_cmd is not defined, skipping notification.
2024-10-14T13:49:48.804394Z 13 [Note] [MY-000000] [Galera] ================================================
View:
** id: bf0fd310-ddc1-11ee-b08a-56f7c4b63281:672824661**
** status: non-primary**
** protocol_version: 4**
** capabilities: MULTI-MASTER, CERTIFICATION, PARALLEL_APPLYING, REPLAY, ISOLATION, PAUSE, CAUSAL_READ, INCREMENTAL_WS, UNORDERED, PREORDERED, STREAMING, NBO**
** final: yes**
** own_index: -1**
** members(0):**
=================================================
2024-10-14T13:49:48.804401Z 13 [Note] [MY-000000] [Galera] Non-primary view
2024-10-14T13:49:48.804407Z 13 [Note] [MY-000000] [WSREP] Server status change connected → disconnected
2024-10-14T13:49:48.804411Z 13 [Note] [MY-000000] [WSREP] wsrep_notify_cmd is not defined, skipping notification.
2024-10-14T13:49:48.804417Z 13 [Note] [MY-000000] [WSREP] wsrep_notify_cmd is not defined, skipping notification.
2024-10-14T13:49:48.804427Z 13 [Note] [MY-000000] [Galera] Waiting 600 seconds for 16 receivers to finish
2024-10-14T13:49:48.812469Z 12 [Note] [MY-000000] [Galera] Slave thread exit. Return code: 6
2024-10-14T13:49:48.812499Z 15 [Note] [MY-000000] [Galera] Slave thread exit. Return code: 6
2024-10-14T13:49:48.812465Z 19 [Note] [MY-000000] [Galera] Slave thread exit. Return code: 6
2024-10-14T13:49:48.812504Z 10 [Note] [MY-000000] [Galera] Slave thread exit. Return code: 6
2024-10-14T13:49:48.812609Z 19 [Note] [MY-000000] [WSREP] Applier thread exiting ret: 6 thd: 19
2024-10-14T13:49:48.812507Z 12 [Note] [MY-000000] [WSREP] Applier thread exiting ret: 6 thd: 12
2024-10-14T13:49:48.812508Z 22 [Note] [MY-000000] [Galera] Slave thread exit. Return code: 6
2024-10-14T13:49:48.812471Z 18 [Note] [MY-000000] [Galera] Slave thread exit. Return code: 6
2024-10-14T13:49:48.812648Z 1 [Note] [MY-000000] [Galera] Slave thread exit. Return code: 6
2024-10-14T13:49:48.812655Z 18 [Note] [MY-000000] [WSREP] Applier thread exiting ret: 6 thd: 18
2024-10-14T13:49:48.812544Z 20 [Note] [MY-000000] [Galera] Slave thread exit. Return code: 6
2024-10-14T13:49:48.812668Z 1 [Note] [MY-000000] [WSREP] Applier thread exiting ret: 6 thd: 1
2024-10-14T13:49:48.812550Z 16 [Note] [MY-000000] [Galera] Slave thread exit. Return code: 6
2024-10-14T13:49:48.812571Z 21 [Note] [MY-000000] [Galera] Slave thread exit. Return code: 6
2024-10-14T13:49:48.812693Z 16 [Note] [MY-000000] [WSREP] Applier thread exiting ret: 6 thd: 16
2024-10-14T13:49:48.812701Z 21 [Note] [MY-000000] [WSREP] Applier thread exiting ret: 6 thd: 21
2024-10-14T13:49:48.812593Z 24 [Note] [MY-000000] [Galera] Slave thread exit. Return code: 6
2024-10-14T13:49:48.812608Z 14 [Note] [MY-000000] [Galera] Slave thread exit. Return code: 6
2024-10-14T13:49:48.812730Z 24 [Note] [MY-000000] [WSREP] Applier thread exiting ret: 6 thd: 24
2024-10-14T13:49:48.812518Z 11 [Note] [MY-000000] [Galera] Slave thread exit. Return code: 6
2024-10-14T13:49:48.812786Z 11 [Note] [MY-000000] [WSREP] Applier thread exiting ret: 6 thd: 11
2024-10-14T13:49:48.812634Z 22 [Note] [MY-000000] [WSREP] Applier thread exiting ret: 6 thd: 22
2024-10-14T13:49:48.812537Z 17 [Note] [MY-000000] [Galera] Slave thread exit. Return code: 6
2024-10-14T13:49:48.812679Z 20 [Note] [MY-000000] [WSREP] Applier thread exiting ret: 6 thd: 20
2024-10-14T13:49:48.812889Z 17 [Note] [MY-000000] [WSREP] Applier thread exiting ret: 6 thd: 17
2024-10-14T13:49:48.812591Z 15 [Note] [MY-000000] [WSREP] Applier thread exiting ret: 6 thd: 15
2024-10-14T13:49:48.812740Z 14 [Note] [MY-000000] [WSREP] Applier thread exiting ret: 6 thd: 14
2024-10-14T13:49:48.812925Z 23 [Note] [MY-000000] [Galera] Slave thread exit. Return code: 6
2024-10-14T13:49:48.812622Z 10 [Note] [MY-000000] [WSREP] Applier thread exiting ret: 6 thd: 10
2024-10-14T13:49:48.812958Z 23 [Note] [MY-000000] [WSREP] Applier thread exiting ret: 6 thd: 23
2024-10-14T13:49:48.815867Z 0 [Note] [MY-000000] [Galera] Service thread queue flushed.
2024-10-14T13:49:48.815902Z 13 [Note] [MY-000000] [Galera] ####### Assign initial position for certification: 00000000-0000-0000-0000-000000000000:-1, protocol version: 6
2024-10-14T13:49:48.815915Z 13 [Note] [MY-000000] [Galera] Slave thread exit. Return code: 0
2024-10-14T13:49:48.815923Z 13 [Note] [MY-000000] [WSREP] Applier thread exiting ret: 0 thd: 13

“failed to apply writeset” usually means the data on that node is not consistent with the rest of the cluster. I would stop this node, erase its $datadir contents, then start it back up. This will force a fresh SST from another node that did successfully apply that transaction.

Thank you Matthew,. we did that yesterday on a node in our db2.cluster that had the issue. Today we have a node in db3.cluster (has a different set of databases on it) that ran into the same problem and thus the logs above. I will have it SST after hours. I want to find a root cause for why this happening, do you have any ideas on how to go about that?

We did not a similar issue with Mariadb so not sure if it is an older issue in the code before the mariadb fork. MDEV-33509 Failed to apply write set with flags=(rollback|pa_unsafe) · MariaDB/server@e0c8165 · GitHub

Just to confirm, db2.cluster and db3.cluster are separate, independent 3-node clusters? (6 servers in total) Have you searched through our https://jira.percona.com/ for a similar issue?

db2.cluster and db3.cluster are separate 4 data nodes and arbitrator clusters (cluster size of 5 each). These are 8.0.36

Yes, I searched through the Jira board last evening and earlier this morning and could not find a good match for the issue. I do not have a crash dump of the memory file but I could surely get one when it happens again if it would be useful.

We are testing 8.0.37 on yet a different cluster which is less critical of a cluster. We have slammed it with MDL and have gotten some entries in the logs for MDL conflicts that did not crash nodes so I believe the MDL issue is resolved. After a few more weeks of testing; we do plan to upgrade one of the two primary production clusters to 8.0.37.

Absolutely useful; one of the best things you can provide to our developers.

Great! Upgrading to the latest is usually the best thing to do for bug fixes.