Cluster crash without auto recovery after SST transfer

Setup:

3 node percona pxc cluster on k8s with the following percona images
Operator: 1.19.0
Cluster: 8.0.42-33.1
Xtrabackup: 8.0.35-34.1

I performed the following test:

  • Bring down pxc-2 for sufficiently long time so that sst gets triggered once it’s back. (DB size is around 2TB)
  • Once pxc-2 starts receiveing SST from pxc-1, I noticed the wsrep_local_recv_queue started growing in pxc-1 without flow control kicking in. It went upto 111K
  • Then I killed pxc-0 and cordoned pxc-0’s node, because of I wanted to test the following:
    • Whether the cluster will accept writes when cluster has 2 nodes and both are in donor and joiner state.
    • If writes are accepted what happens if I bring back pxc-0 and kill pxc-1 since now both pxc-0 and pxc-2 have incomplete data.

But after the SST started all writes on pxc-1 were stuck due to large wsrep_local_recv_queue, so I didn’t bring back pxc-0 back up and let the sst complete first, but as soon asit got completed the cluster crashed and auto recovery is also not kicking in with operator logs having:
2026-05-06T11:27:56.744Z INFO reconcile replication error {“controller”: “pxc-controller”, “controllerGroup”: “``pxc.percona.com``”, “controllerKind”: “PerconaXtraDBCluster”, “PerconaXtraDBCluster”: {“name”:“mysqlcluster”,“namespace”:“percona-operator”}, “namespace”: “percona-operator”, “name”: “mysqlcluster”, “reconcileID”: “08f72492-4a3e-48a1-8368-e3678cd4d698”, “err”: “get primary pxc pod: failed to get proxy connection: invalid connection”}

pxc-2 Joiner logs:

2026-05-06T11:21:05.594417Z 2 [Warning] [MY-000000] [Galera] certification interval 111989 for trx source: 0c66cd6f-4926-11f1-b315-633843a22e4a version: 6 local: 0 flags: 65 conn_id: 257587 trx_id: 1260641 tstamp: 1189661466920572; state:  seqnos (l: 165440, g: 204941, s: 92952, d: 27417) WS pa_range: 65535; state history: REPLICATING:0->CERTIFYING:3540 exceeds the limit of 16384
2026-05-06T11:21:05.594437Z 14 [Warning] [MY-000000] [Galera] certification interval 111990 for trx source: 0c66cd6f-4926-11f1-b315-633843a22e4a version: 6 local: 0 flags: 65 conn_id: 269089 trx_id: 1291988 tstamp: 1189785710198300; state:  seqnos (l: 165441, g: 204942, s: 92952, d: 27417) WS pa_range: 65535; state history: REPLICATING:0->CERTIFYING:3540 exceeds the limit of 16384
2026-05-06T11:21:05.594459Z 15 [Warning] [MY-000000] [Galera] certification interval 111991 for trx source: 0c66cd6f-4926-11f1-b315-633843a22e4a version: 6 local: 0 flags: 65 conn_id: 269090 trx_id: 1291989 tstamp: 1189785710854306; state:  seqnos (l: 165442, g: 204943, s: 92952, d: 27417) WS pa_range: 65535; state history: REPLICATING:0->CERTIFYING:3540 exceeds the limit of 16384
2026-05-06T11:21:05.594505Z 16 [Note] [MY-000000] [Galera] Server mysqlcluster-pxc-2 synced with group
2026-05-06T11:21:05.594516Z 16 [Note] [MY-000000] [WSREP] Server status change joined -> synced
2026-05-06T11:21:05.594521Z 16 [Note] [MY-000000] [WSREP] Synchronized with group, ready for connections
2026-05-06T11:25:19.542053Z 0 [Note] [MY-000000] [Galera] (e1f44f59-a6e2, 'ssl://0.0.0.0:4567') connection to peer 0c66cd6f-b315 with addr ssl://100.100.23.171:4567 timed out, no messages seen in PT3S, socket stats: rtt: 7966 rttvar: 13293 rto: 208000 lost: 0 last_data_recv: 3003 cwnd: 10 last_queued_since: 3411068 last_delivered_since: 3003304157 send_queue_length: 0 send_queue_bytes: 0 segment: 0 messages: 0 (gmcast.peer_timeout)
2026-05-06T11:25:19.542201Z 0 [Note] [MY-000000] [Galera] (e1f44f59-a6e2, 'ssl://0.0.0.0:4567') turning message relay requesting on, nonlive peers: ssl://100.100.23.171:4567
2026-05-06T11:25:21.042107Z 0 [Note] [MY-000000] [Galera] (e1f44f59-a6e2, 'ssl://0.0.0.0:4567') reconnecting to 0c66cd6f-b315 (ssl://100.100.23.171:4567), attempt 0
2026-05-06T11:25:21.661167Z 0 [Note] [MY-000000] [Galera] declaring node with index 0 suspected, timeout PT5S (evs.suspect_timeout)
2026-05-06T11:25:21.661208Z 0 [Note] [MY-000000] [Galera] evs::proto(e1f44f59-a6e2, OPERATIONAL, view_id(REG,0c66cd6f-b315,4)) suspecting node: 0c66cd6f-b315
2026-05-06T11:25:21.661217Z 0 [Note] [MY-000000] [Galera] evs::proto(e1f44f59-a6e2, OPERATIONAL, view_id(REG,0c66cd6f-b315,4)) suspected node without join message, declaring inactive
2026-05-06T11:25:22.161296Z 0 [Note] [MY-000000] [Galera] declaring node with index 0 inactive (evs.inactive_timeout)
2026-05-06T11:25:22.661342Z 0 [Note] [MY-000000] [Galera] Current view of cluster as seen by this node
view (view_id(NON_PRIM,0c66cd6f-b315,4)
memb {
	e1f44f59-a6e2,0
	}
joined {
	}
left {
	}
partitioned {
	0c66cd6f-b315,0
	}
)
2026-05-06T11:25:22.661419Z 0 [Note] [MY-000000] [Galera] Current view of cluster as seen by this node
view (view_id(NON_PRIM,e1f44f59-a6e2,5)
memb {
	e1f44f59-a6e2,0
	}
joined {
	}
left {
	}
partitioned {
	0c66cd6f-b315,0
	}
)
2026-05-06T11:25:22.661425Z 0 [Note] [MY-000000] [Galera] New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
2026-05-06T11:25:22.661513Z 0 [Note] [MY-000000] [Galera] Flow-control interval: [100, 100]
2026-05-06T11:25:22.661522Z 0 [Note] [MY-000000] [Galera] Received NON-PRIMARY.
2026-05-06T11:25:22.661529Z 0 [Note] [MY-000000] [Galera] Shifting SYNCED -> OPEN (TO: 205084)
2026-05-06T11:25:22.661566Z 0 [Note] [MY-000000] [Galera] New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
2026-05-06T11:25:22.661587Z 0 [Note] [MY-000000] [Galera] Flow-control interval: [100, 100]
2026-05-06T11:25:22.661594Z 0 [Note] [MY-000000] [Galera] Received NON-PRIMARY.
2026-05-06T11:25:22.661628Z 13 [Note] [MY-000000] [Galera] Maybe drain monitors from 205084 upto current CC event 205084 upto:205084
2026-05-06T11:25:22.661647Z 13 [Note] [MY-000000] [Galera] Drain monitors from 205084 up to 205084
2026-05-06T11:25:22.661669Z 13 [Note] [MY-000000] [Galera] ================================================
View:
  id: 0c66f78d-4926-11f1-97e4-b247b7ef5d59:205084
  status: non-primary
  protocol_version: 4
  capabilities: MULTI-MASTER, CERTIFICATION, PARALLEL_APPLYING, REPLAY, ISOLATION, PAUSE, CAUSAL_READ, INCREMENTAL_WS, UNORDERED, PREORDERED, STREAMING, NBO
  final: no
  own_index: 0
  members(1):
	0: e1f44f59-4930-11f1-a6e2-bb0b6dcabbfb, mysqlcluster-pxc-2
=================================================
2026-05-06T11:25:22.661679Z 13 [Note] [MY-000000] [Galera] Non-primary view
2026-05-06T11:25:22.661688Z 13 [Note] [MY-000000] [WSREP] Server status change synced -> connected
2026-05-06T11:25:22.795792Z 13 [Note] [MY-000000] [Galera] Maybe drain monitors from 205084 upto current CC event 205084 upto:205084
2026-05-06T11:25:22.795822Z 13 [Note] [MY-000000] [Galera] Drain monitors from 205084 up to 205084
2026-05-06T11:25:22.795843Z 13 [Note] [MY-000000] [Galera] ================================================
View:
  id: 0c66f78d-4926-11f1-97e4-b247b7ef5d59:205084
  status: non-primary
  protocol_version: 4
  capabilities: MULTI-MASTER, CERTIFICATION, PARALLEL_APPLYING, REPLAY, ISOLATION, PAUSE, CAUSAL_READ, INCREMENTAL_WS, UNORDERED, PREORDERED, STREAMING, NBO
  final: no
  own_index: 0
  members(1):
	0: e1f44f59-4930-11f1-a6e2-bb0b6dcabbfb, mysqlcluster-pxc-2
=================================================
2026-05-06T11:25:22.795851Z 13 [Note] [MY-000000] [Galera] Non-primary view
2026-05-06T11:25:22.795859Z 13 [Note] [MY-000000] [WSREP] Server status change connected -> connected
2026-05-06T11:25:24.042361Z 0 [Note] [MY-000000] [Galera] (e1f44f59-a6e2, 'ssl://0.0.0.0:4567') connection to peer 00000000-0000 with addr ssl://100.100.23.171:4567 timed out, no messages seen in PT3S, socket stats: rtt: 124 rttvar: 68 rto: 201000 lost: 0 last_data_recv: 3000 cwnd: 10 last_queued_since: 1190281576854190 last_delivered_since: 1190281576854190 send_queue_length: 0 send_queue_bytes: 0 (gmcast.peer_timeout)
2026-05-06T11:25:24.042605Z 0 [Note] [MY-000000] [Galera] Failed to establish connection: Operation aborted.
2026-05-06T11:25:28.542631Z 0 [Note] [MY-000000] [Galera] (e1f44f59-a6e2, 'ssl://0.0.0.0:4567') connection to peer 00000000-0000 with addr ssl://100.100.23.171:4567 timed out, no messages seen in PT3S, socket stats: rtt: 110 rttvar: 57 rto: 201000 lost: 0 last_data_recv: 3000 cwnd: 10 last_queued_since: 1190286077115466 last_delivered_since: 1190286077115466 send_queue_length: 0 send_queue_bytes: 0 (gmcast.peer_timeout)
2026-05-06T11:25:28.542761Z 0 [Note] [MY-000000] [Galera] Failed to establish connection: Operation aborted.
2026-05-06T11:25:31.556842Z 0 [Note] [MY-000000] [Galera] (e1f44f59-a6e2, 'ssl://0.0.0.0:4567') connection established to 0c66cd6f-b315 ssl://100.100.23.171:4567
2026-05-06T11:25:33.544083Z 0 [Note] [MY-000000] [Galera] (e1f44f59-a6e2, 'ssl://0.0.0.0:4567') reconnecting to 0c66cd6f-b315 (ssl://100.100.23.171:4567), attempt 0
2026-05-06T11:25:33.544410Z 0 [Note] [MY-000000] [Galera] Failed to establish connection: Connection refused
2026-05-06T11:25:35.044377Z 0 [Note] [MY-000000] [Galera] Failed to establish connection: Connection refused
2026-05-06T11:25:36.544497Z 0 [Note] [MY-000000] [Galera] Failed to establish connection: Connection refused

pxc-1 (Donor) logs:

6-05-06T11:26:13.312508Z 0 [Note] [MY-010733] [Server] Shutting down plugin 'caching_sha2_password'
2026-05-06T11:26:13.312514Z 0 [Note] [MY-010733] [Server] Shutting down plugin 'sha256_password'
2026-05-06T11:26:13.312518Z 0 [Note] [MY-010733] [Server] Shutting down plugin 'mysql_native_password'
2026-05-06T11:26:13.312521Z 0 [Note] [MY-010733] [Server] Shutting down plugin 'daemon_keyring_proxy_plugin'
2026-05-06T11:26:13.312534Z 0 [Note] [MY-010733] [Server] Shutting down plugin 'keyring_vault'
2026-05-06T11:26:13.313637Z 0 [Note] [MY-010733] [Server] Shutting down plugin 'binlog'
2026-05-06T11:26:13.314130Z 0 [System] [MY-010910] [Server] /usr/sbin/mysqld: Shutdown complete (mysqld 8.0.42-33.1)  Percona XtraDB Cluster (GPL), Release rel33, Revision 6673f8e, WSREP version 26.1.4.3.
2026-05-06T11:26:10.877534Z 0 [Note] [MY-000000] [WSREP] Recovered position: 0c66f78d-4926-11f1-97e4-b247b7ef5d59:92952
+ [[ -z '' ]]
+ [[ -f /var/lib/mysql//grastate.dat ]]
+ [[ 0 != 1 ]]
+ trap '{ node_recovery "$@" ; }' USR1
+ touch /tmp/recovery-case
+ [[ -z 92952 ]]
+ set +o xtrace
#####################################################FULL_PXC_CLUSTER_CRASH:mysqlcluster-pxc-1.mysqlcluster-pxc.percona-operator.svc.cluster.local#####################################################
You have the situation of a full PXC cluster crash. In order to restore your PXC cluster, please check the log
from all pods/nodes to find the node with the most recent data (the one with the highest sequence number (seqno).
It is mysqlcluster-pxc-1.mysqlcluster-pxc.percona-operator.svc.cluster.local node with sequence number (seqno): 92952
Cluster will recover automatically from the crash now.
If you have set spec.pxc.autoRecovery to false, run the following command to recover manually from this node:
kubectl -n percona-operator exec mysqlcluster-pxc-1 -c pxc -- sh -c 'kill -s USR1 1'
#####################################################LAST_LINE:mysqlcluster-pxc-1.mysqlcluster-pxc.percona-operator.svc.cluster.local:92952:#####################################################

pxc-2 logs after container restart:

Skipping wsrep-recover for 0c66f78d-4926-11f1-97e4-b247b7ef5d59:205084 pair
Assigning 0c66f78d-4926-11f1-97e4-b247b7ef5d59:205084 to wsrep_start_position
+ [[ -z '' ]]
+ [[ -f /var/lib/mysql//grastate.dat ]]
+ [[ 0 != 1 ]]
+ trap '{ node_recovery "$@" ; }' USR1
+ touch /tmp/recovery-case
+ [[ -z 205084 ]]
+ set +o xtrace
#####################################################FULL_PXC_CLUSTER_CRASH:mysqlcluster-pxc-2.mysqlcluster-pxc.percona-operator.svc.cluster.local#####################################################
You have the situation of a full PXC cluster crash. In order to restore your PXC cluster, please check the log
from all pods/nodes to find the node with the most recent data (the one with the highest sequence number (seqno).
It is mysqlcluster-pxc-2.mysqlcluster-pxc.percona-operator.svc.cluster.local node with sequence number (seqno): 205084
Cluster will recover automatically from the crash now.
If you have set spec.pxc.autoRecovery to false, run the following command to recover manually from this node:
kubectl -n percona-operator exec mysqlcluster-pxc-2 -c pxc -- sh -c 'kill -s USR1 1'
#####################################################LAST_LINE:mysqlcluster-pxc-2.mysqlcluster-pxc.percona-operator.svc.cluster.local:205084:#####################################################

Galera State:

sudo kubectl exec mysqlcluster-pxc-1 -n percona-operator -c pxc -- \
    cat /var/lib/mysql/grastate.dat
# GALERA saved state
version: 2.1
uuid:    0c66f78d-4926-11f1-97e4-b247b7ef5d59
seqno:   -1
safe_to_bootstrap: 0

sudo kubectl exec mysqlcluster-pxc-2 -n percona-operator -c pxc -- \c -- \
    cat /var/lib/mysql/grastate.dat
# GALERA saved state
version: 2.1
uuid:    0c66f78d-4926-11f1-97e4-b247b7ef5d59
seqno:   205084
safe_to_bootstrap: 0

I wanted to ask
1)Did the pxc-1 queue grow so much because of paused writes during SST donation?
2)How did flow control not get kicked in if pxc-1 had such a long queue?
3)Why did the operator not trigger auto recovery?

@Yash_Daga for 1 and 2, I’ll raise this internally.

For 3, I’ll try to simulate your scenario to see if anything is wrong with operator’s recovery logic.

Thanks a lot for such a quick response!
Attaching more operator logs as well:

2026-05-06T11:27:34.644Z	INFO	reconcile replication error	{"controller": "pxc-controller", "controllerGroup": "pxc.percona.com", "controllerKind": "PerconaXtraDBCluster", "PerconaXtraDBCluster": {"name":"mysqlcluster","namespace":"percona-operator"}, "namespace": "percona-operator", "name": "mysqlcluster", "reconcileID": "fb396b6e-94d7-48cf-b6c9-23611133530f", "err": "get primary pxc pod: failed to get proxy connection: invalid connection"}
[mysql] 2026/05/06 11:27:42 packets.go:58 unexpected EOF
[mysql] 2026/05/06 11:27:43 packets.go:58 unexpected EOF
[mysql] 2026/05/06 11:27:44 packets.go:58 unexpected EOF
[mysql] 2026/05/06 11:27:45 packets.go:58 unexpected EOF
2026-05-06T11:27:45.696Z	INFO	reconcile replication error	{"controller": "pxc-controller", "controllerGroup": "pxc.percona.com", "controllerKind": "PerconaXtraDBCluster", "PerconaXtraDBCluster": {"name":"mysqlcluster","namespace":"percona-operator"}, "namespace": "percona-operator", "name": "mysqlcluster", "reconcileID": "1dbcc8f1-f2da-4394-9d6d-c732953ed649", "err": "get primary pxc pod: failed to get proxy connection: invalid connection"}
[mysql] 2026/05/06 11:27:53 packets.go:58 unexpected EOF
[mysql] 2026/05/06 11:27:54 packets.go:58 unexpected EOF
[mysql] 2026/05/06 11:27:55 packets.go:58 unexpected EOF
[mysql] 2026/05/06 11:27:56 packets.go:58 unexpected EOF
2026-05-06T11:27:56.744Z	INFO	reconcile replication error	{"controller": "pxc-controller", "controllerGroup": "pxc.percona.com", "controllerKind": "PerconaXtraDBCluster", "PerconaXtraDBCluster": {"name":"mysqlcluster","namespace":"percona-operator"}, "namespace": "percona-operator", "name": "mysqlcluster", "reconcileID": "08f72492-4a3e-48a1-8368-e3678cd4d698", "err": "get primary pxc pod: failed to get proxy connection: invalid connection"}
2026-05-06T12:00:06.001Z	INFO	cluster is not ready	{"controller": "pxc-controller", "controllerGroup": "pxc.percona.com", "controllerKind": "PerconaXtraDBCluster", "PerconaXtraDBCluster": {"name":"mysqlcluster","namespace":"percona-operator"}, "namespace": "percona-operator", "name": "mysqlcluster", "reconcileID": "c75d7fe1-82d1-4c9a-a228-9fed8f904588"}
2026-05-06T13:00:06.000Z	INFO	cluster is not ready	{"controller": "pxc-controller", "controllerGroup": "pxc.percona.com", "controllerKind": "PerconaXtraDBCluster", "PerconaXtraDBCluster": {"name":"mysqlcluster","namespace":"percona-operator"}, "namespace": "percona-operator", "name": "mysqlcluster", "reconcileID": "c75d7fe1-82d1-4c9a-a228-9fed8f904588"}
2026-05-06T14:00:06.000Z	INFO	cluster is not ready	{"controller": "pxc-controller", "controllerGroup": "pxc.percona.com", "controllerKind": "PerconaXtraDBCluster", "PerconaXtraDBCluster": {"name":"mysqlcluster","namespace":"percona-operator"}, "namespace": "percona-operator", "name": "mysqlcluster", "reconcileID": "c75d7fe1-82d1-4c9a-a228-9fed8f904588"}

@Ege_Gunes I think I found the reason why recovery is not happening:

Here if all pods are not running the operator silently skips autorecovery, and in my case I’ve deleted pxc-0 and cordoned that node, so it’s not ready.

Additionally even if this would’ve passed I see the next check is hardcoded to pxc-0

sWaiting, _, err := r.isPodWaitingForRecovery(cr.Namespace, cr.Name+"-pxc-0")

I see, yes that’s probably the issue.

Don’t know if we can something about it though, because we need all pods to be running and reporting their seqno to select the pod with highest sequence to rebootstrap the cluster. If only some pods are running, operator won’t be able to do it safely and only remedy is to do the recovery manually by running kill -s USR1 1 in the selected pod.

@Ege_Gunes Understood that makes sense, only issue is if cluster crashes when we have a node down for a long time, but maybe it’s safer to do manual recovery in that case.

In this case after uncordoning pxc-0’s node the operator successfully did auto recovery.

It would be very helpful if you could help me understand 1 and 2 along with why did cluster crash.
From the logs I see that pxc-1 died because of exceeding max_user_connections and pxc-2’s liveness failing because it became non primary, but I am not able to understand which is cause and which is the consequence.

sudo kubectl get events -n percona-operator \
    --field-selector involvedObject.name=mysqlcluster-pxc-1 \
    -o custom-columns=TS:.lastTimestamp,REASON:.reason,MSG:.message | \
    grep "max_user_connections"
2026-05-06T11:25:31Z   Unhealthy          Liveness probe failed: ERROR 1226 (42000): User 'monitor' has exceeded the 'max_user_connections' resource (current value: 100)
2026-05-06T11:25:31Z   Unhealthy          Readiness probe failed: ERROR 1226 (42000): User 'monitor' has exceeded the 'max_user_connections' resource (current value: 100)


sudo kubectl get events -n percona-operator \
    --field-selector involvedObject.name=mysqlcluster-pxc-2 \
    -o custom-columns=TS:.lastTimestamp,TYPE:.type,REASON:.reason,MSG:.message | \
    grep -E "Killing|Unhealthy|Liveness"
2026-05-06T02:36:19Z   Warning   Unhealthy          Readiness probe failed: command timed out: "/var/lib/mysql/readiness-check.sh" timed out after 15s
2026-05-06T02:36:19Z   Warning   Unhealthy          Liveness probe failed: command timed out: "/var/lib/mysql/liveness-check.sh" timed out after 10s
2026-05-06T03:18:38Z   Normal    Killing            Stopping container linkerd-proxy
2026-05-06T03:18:38Z   Normal    Killing            Stopping container pxc
2026-05-06T03:19:00Z   Warning   Unhealthy          Readiness probe errored and resulted in unknown state: rpc error: code = Canceled desc = context canceled
2026-05-06T14:59:20Z   Warning   Unhealthy          Readiness probe failed: ERROR 2003 (HY000): Can't connect to MySQL server on '100.100.13.217:33062' (111)
2026-05-06T15:34:16Z   Warning   Unhealthy          Readiness probe failed: command timed out: "/var/lib/mysql/readiness-check.sh" timed out after 15s
2026-05-06T11:20:29Z   Warning   Unhealthy          Readiness probe failed: + [[ Primary == \P\r\i\m\a\r\y ]]
2026-05-06T11:35:29Z   Warning   Unhealthy          Liveness probe failed: + [[ -n non-Primary ]]
2026-05-06T11:40:24Z   Warning   Unhealthy          Readiness probe failed: + [[ non-Primary == \P\r\i\m\a\r\y ]]
2026-05-06T11:44:59Z   Normal    Killing            Container pxc failed liveness probe, will be restarted
2026-05-06T15:34:16Z   Warning   Unhealthy          Liveness probe failed: command timed out: "/var/lib/mysql/liveness-check.sh" timed out after 10s
polaris@RKLAB-RVMHM255S006481:~$