Percona XtraDB Cluster 8.0.36 [Galera] Failed to apply write set: taking wsrep_ready to OFF

Scott_Hooper · October 14, 2024, 2:42pm

We have recently seen this type of error in our various Percona clusters on different nodes. These is the logs from the most recent one. Essentially it sets wsrep_ready to OFF and then ejects the node from the cluster. Would this happen to be a known issue/bug with the software? Is anyone else encountering this?

2024-10-14T13:49:48.289577Z 23 [ERROR] [MY-000000] [Galera] Failed to apply write set: gtid: bf0fd310-ddc1-11ee-b08a-56f7c4b63281:672824661 server_id: d6636fa0-7543-11ef-98e6-333d82fcb296 client_id: 18446744073709551615 trx_id: 601552680 flags: 20 (rollback | pa_unsafe)
2024-10-14T13:49:48.291474Z 23 [Note] [MY-000000] [Galera] Closing send monitor…
2024-10-14T13:49:48.291489Z 23 [Note] [MY-000000] [Galera] Closed send monitor.
2024-10-14T13:49:48.292032Z 23 [Note] [MY-000000] [Galera] gcomm: terminating thread
2024-10-14T13:49:48.292529Z 23 [Note] [MY-000000] [Galera] gcomm: joining thread
2024-10-14T13:49:48.293518Z 23 [Note] [MY-000000] [Galera] gcomm: closing backend
2024-10-14T13:49:48.797958Z 23 [Note] [MY-000000] [Galera] Current view of cluster as seen by this node
view (view_id(NON_PRIM,013128a8-9978,126)
memb {
** d6636fa0-98e6,2**
** }**
joined {
** }**
left {
** }**
partitioned {
** 013128a8-9978,0**
** 1b4b870c-bc73,2**
** 317889e9-b66a,1**
** 8aa87f00-8ba4,1**
** }**
)
2024-10-14T13:49:48.798030Z 23 [Note] [MY-000000] [Galera] PC protocol downgrade 1 → 0
2024-10-14T13:49:48.798041Z 23 [Note] [MY-000000] [Galera] Current view of cluster as seen by this node
view ((empty))
2024-10-14T13:49:48.802080Z 23 [Note] [MY-000000] [Galera] gcomm: closed
2024-10-14T13:49:48.802191Z 0 [Note] [MY-000000] [Galera] New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
2024-10-14T13:49:48.802259Z 0 [Note] [MY-000000] [Galera] Flow-control interval: [64, 64]
2024-10-14T13:49:48.802265Z 0 [Note] [MY-000000] [Galera] Received NON-PRIMARY.
2024-10-14T13:49:48.802269Z 0 [Note] [MY-000000] [Galera] Shifting SYNCED → OPEN (TO: 672824661)
2024-10-14T13:49:48.802285Z 0 [Note] [MY-000000] [Galera] New SELF-LEAVE.
2024-10-14T13:49:48.802363Z 0 [Note] [MY-000000] [Galera] Flow-control interval: [0, 0]
2024-10-14T13:49:48.802376Z 0 [Note] [MY-000000] [Galera] Received SELF-LEAVE. Closing connection.
2024-10-14T13:49:48.802395Z 0 [Note] [MY-000000] [Galera] Shifting OPEN → CLOSED (TO: 672824661)
2024-10-14T13:49:48.802436Z 0 [Note] [MY-000000] [Galera] RECV thread exiting 0: Success
2024-10-14T13:49:48.802464Z 13 [Note] [MY-000000] [Galera] ================================================
View:
** id: bf0fd310-ddc1-11ee-b08a-56f7c4b63281:672824661**
** status: non-primary**
** protocol_version: 4**
** capabilities: MULTI-MASTER, CERTIFICATION, PARALLEL_APPLYING, REPLAY, ISOLATION, PAUSE, CAUSAL_READ, INCREMENTAL_WS, UNORDERED, PREORDERED, STREAMING, NBO**
** final: no**
** own_index: 0**
** members(1):**
** 0: d6636fa0-7543-11ef-98e6-333d82fcb296, node1.db3.cluster**
=================================================
2024-10-14T13:49:48.802491Z 13 [Note] [MY-000000] [Galera] Non-primary view
2024-10-14T13:49:48.802498Z 13 [Note] [MY-000000] [WSREP] Server status change synced → connected
2024-10-14T13:49:48.802815Z 23 [Note] [MY-000000] [Galera] recv_thread() joined.
2024-10-14T13:49:48.802834Z 23 [Note] [MY-000000] [Galera] Closing replication queue.
2024-10-14T13:49:48.802841Z 23 [Note] [MY-000000] [Galera] Closing slave action queue.
2024-10-14T13:49:48.803461Z 13 [Note] [MY-000000] [WSREP] wsrep_notify_cmd is not defined, skipping notification.
2024-10-14T13:49:48.804351Z 13 [Note] [MY-000000] [WSREP] wsrep_notify_cmd is not defined, skipping notification.
2024-10-14T13:49:48.804394Z 13 [Note] [MY-000000] [Galera] ================================================
View:
** id: bf0fd310-ddc1-11ee-b08a-56f7c4b63281:672824661**
** status: non-primary**
** protocol_version: 4**
** capabilities: MULTI-MASTER, CERTIFICATION, PARALLEL_APPLYING, REPLAY, ISOLATION, PAUSE, CAUSAL_READ, INCREMENTAL_WS, UNORDERED, PREORDERED, STREAMING, NBO**
** final: yes**
** own_index: -1**
** members(0):**
=================================================
2024-10-14T13:49:48.804401Z 13 [Note] [MY-000000] [Galera] Non-primary view
2024-10-14T13:49:48.804407Z 13 [Note] [MY-000000] [WSREP] Server status change connected → disconnected
2024-10-14T13:49:48.804411Z 13 [Note] [MY-000000] [WSREP] wsrep_notify_cmd is not defined, skipping notification.
2024-10-14T13:49:48.804417Z 13 [Note] [MY-000000] [WSREP] wsrep_notify_cmd is not defined, skipping notification.
2024-10-14T13:49:48.804427Z 13 [Note] [MY-000000] [Galera] Waiting 600 seconds for 16 receivers to finish
2024-10-14T13:49:48.812469Z 12 [Note] [MY-000000] [Galera] Slave thread exit. Return code: 6
2024-10-14T13:49:48.812499Z 15 [Note] [MY-000000] [Galera] Slave thread exit. Return code: 6
2024-10-14T13:49:48.812465Z 19 [Note] [MY-000000] [Galera] Slave thread exit. Return code: 6
2024-10-14T13:49:48.812504Z 10 [Note] [MY-000000] [Galera] Slave thread exit. Return code: 6
2024-10-14T13:49:48.812609Z 19 [Note] [MY-000000] [WSREP] Applier thread exiting ret: 6 thd: 19
2024-10-14T13:49:48.812507Z 12 [Note] [MY-000000] [WSREP] Applier thread exiting ret: 6 thd: 12
2024-10-14T13:49:48.812508Z 22 [Note] [MY-000000] [Galera] Slave thread exit. Return code: 6
2024-10-14T13:49:48.812471Z 18 [Note] [MY-000000] [Galera] Slave thread exit. Return code: 6
2024-10-14T13:49:48.812648Z 1 [Note] [MY-000000] [Galera] Slave thread exit. Return code: 6
2024-10-14T13:49:48.812655Z 18 [Note] [MY-000000] [WSREP] Applier thread exiting ret: 6 thd: 18
2024-10-14T13:49:48.812544Z 20 [Note] [MY-000000] [Galera] Slave thread exit. Return code: 6
2024-10-14T13:49:48.812668Z 1 [Note] [MY-000000] [WSREP] Applier thread exiting ret: 6 thd: 1
2024-10-14T13:49:48.812550Z 16 [Note] [MY-000000] [Galera] Slave thread exit. Return code: 6
2024-10-14T13:49:48.812571Z 21 [Note] [MY-000000] [Galera] Slave thread exit. Return code: 6
2024-10-14T13:49:48.812693Z 16 [Note] [MY-000000] [WSREP] Applier thread exiting ret: 6 thd: 16
2024-10-14T13:49:48.812701Z 21 [Note] [MY-000000] [WSREP] Applier thread exiting ret: 6 thd: 21
2024-10-14T13:49:48.812593Z 24 [Note] [MY-000000] [Galera] Slave thread exit. Return code: 6
2024-10-14T13:49:48.812608Z 14 [Note] [MY-000000] [Galera] Slave thread exit. Return code: 6
2024-10-14T13:49:48.812730Z 24 [Note] [MY-000000] [WSREP] Applier thread exiting ret: 6 thd: 24
2024-10-14T13:49:48.812518Z 11 [Note] [MY-000000] [Galera] Slave thread exit. Return code: 6
2024-10-14T13:49:48.812786Z 11 [Note] [MY-000000] [WSREP] Applier thread exiting ret: 6 thd: 11
2024-10-14T13:49:48.812634Z 22 [Note] [MY-000000] [WSREP] Applier thread exiting ret: 6 thd: 22
2024-10-14T13:49:48.812537Z 17 [Note] [MY-000000] [Galera] Slave thread exit. Return code: 6
2024-10-14T13:49:48.812679Z 20 [Note] [MY-000000] [WSREP] Applier thread exiting ret: 6 thd: 20
2024-10-14T13:49:48.812889Z 17 [Note] [MY-000000] [WSREP] Applier thread exiting ret: 6 thd: 17
2024-10-14T13:49:48.812591Z 15 [Note] [MY-000000] [WSREP] Applier thread exiting ret: 6 thd: 15
2024-10-14T13:49:48.812740Z 14 [Note] [MY-000000] [WSREP] Applier thread exiting ret: 6 thd: 14
2024-10-14T13:49:48.812925Z 23 [Note] [MY-000000] [Galera] Slave thread exit. Return code: 6
2024-10-14T13:49:48.812622Z 10 [Note] [MY-000000] [WSREP] Applier thread exiting ret: 6 thd: 10
2024-10-14T13:49:48.812958Z 23 [Note] [MY-000000] [WSREP] Applier thread exiting ret: 6 thd: 23
2024-10-14T13:49:48.815867Z 0 [Note] [MY-000000] [Galera] Service thread queue flushed.
2024-10-14T13:49:48.815902Z 13 [Note] [MY-000000] [Galera] ####### Assign initial position for certification: 00000000-0000-0000-0000-000000000000:-1, protocol version: 6
2024-10-14T13:49:48.815915Z 13 [Note] [MY-000000] [Galera] Slave thread exit. Return code: 0
2024-10-14T13:49:48.815923Z 13 [Note] [MY-000000] [WSREP] Applier thread exiting ret: 0 thd: 13

matthewb · October 14, 2024, 3:28pm

“failed to apply writeset” usually means the data on that node is not consistent with the rest of the cluster. I would stop this node, erase its $datadir contents, then start it back up. This will force a fresh SST from another node that did successfully apply that transaction.

Scott_Hooper · October 14, 2024, 4:05pm

Thank you Matthew,. we did that yesterday on a node in our db2.cluster that had the issue. Today we have a node in db3.cluster (has a different set of databases on it) that ran into the same problem and thus the logs above. I will have it SST after hours. I want to find a root cause for why this happening, do you have any ideas on how to go about that?

We did not a similar issue with Mariadb so not sure if it is an older issue in the code before the mariadb fork. MDEV-33509 Failed to apply write set with flags=(rollback|pa_unsafe) · MariaDB/server@e0c8165 · GitHub

matthewb · October 14, 2024, 5:27pm

Just to confirm, db2.cluster and db3.cluster are separate, independent 3-node clusters? (6 servers in total) Have you searched through our https://jira.percona.com/ for a similar issue?

Scott_Hooper · October 14, 2024, 7:17pm

db2.cluster and db3.cluster are separate 4 data nodes and arbitrator clusters (cluster size of 5 each). These are 8.0.36

Yes, I searched through the Jira board last evening and earlier this morning and could not find a good match for the issue. I do not have a crash dump of the memory file but I could surely get one when it happens again if it would be useful.

We are testing 8.0.37 on yet a different cluster which is less critical of a cluster. We have slammed it with MDL and have gotten some entries in the logs for MDL conflicts that did not crash nodes so I believe the MDL issue is resolved. After a few more weeks of testing; we do plan to upgrade one of the two primary production clusters to 8.0.37.

matthewb · October 14, 2024, 8:08pm

Absolutely useful; one of the best things you can provide to our developers.

Great! Upgrading to the latest is usually the best thing to do for bug fixes.

Scott_Hooper · May 6, 2025, 11:52am

It’s been a while since the last issue. We are currently on 8.0.39 and working to upgrade to 8.0.41 which is still being tested on our development and testing clusters. However with that said we received to more instances of this space specific issue. One on 4/29 and one on 5/5. The setup is the same as described and these happened on db2.cluster. New Percona versions though i.e. 8.039. So its been about 7 months or so but seems like the concern and risk for this behavior remains. any thoughts?

matthewb · May 6, 2025, 4:39pm

What SQL is running when this happens?

Topic		Replies	Views
Have an issue with cluster 2 Nodes keep dropping offline every day & rejoin issues. Percona XtraDB Cluster 5.x	2	1637	June 19, 2015
Percona if down two node Percona XtraDB Cluster 8.x	6	414	February 7, 2024
Cluster hang with wsrep: initiating replication for write set Percona XtraDB Cluster 5.x	8	6918	November 14, 2017
Troubles with Percona Xtradbcluster Percona XtraDB Cluster 5.x	2	753	March 24, 2020
Node refuses to re-enter cluster Percona XtraDB Cluster 5.x	1	2977	June 9, 2014

Percona XtraDB Cluster 8.0.36 [Galera] Failed to apply write set: taking wsrep_ready to OFF

Related topics