PXC Operator wsrep_notify_cmd handler hangs indefinitely on joiner state, curl to PITR has no timeout, blocks SST/IST

Hello,

I’m was facing an issue in PXC Operator v1.19.0 where a missing curl timeout in the Galera notify handler permanently blocks nodes from rejoining the cluster.

The Problem:

When a PXC node enters joiner state, the wsrep_cmd_notify_handler.sh script makes a synchronous curl call to the PITR service’s /invalidate-cache/ endpoint. This curl has no --connect-timeout, no --max-time, and no || true fallback. If the
PITR service is slow or unresponsive, the curl hangs indefinitely, Galera never proceeds to SST/IST, and the node never comes up. The script also runs under set -o errexit, which is problematic for what should be a best-effort notification
handler.

Evidence From Our Cluster

We run a 4-node on-prem PXC cluster. Two nodes were stuck in joiner state for over 35 minutes, with only one node serving traffic. The cluster had been in a crash loop for 7+ days — every restart triggered the same hang, the liveness probe
killed the container, and the cycle repeated.

The notify handler and curl were running for 35+ minutes inside the PXC containers:
mysql 380 S 12:58 /bin/bash /var/lib/mysql/wsrep_cmd_notify_handler.sh --status joiner
mysql 389 S 12:58 curl -d hostname=mysqlcluster-pxc-1… ``http://mysqlcluster-pitr``…:8080/invalidate-cache/
No xtrabackup or socat processes existed — Galera had not started state transfer at all.

PITR was itself blocked on a MinIO call that failed after ~16 minutes without ever sending an HTTP response:
2026/03/24 12:59:13 invalidating cache for mysqlcluster-pxc-0…
2026/03/24 13:15:46 ERROR: invalidate cache: load cache: get cache from storage:
read object gtid-binlog-cache.json: context canceled

Direct curl test from a healthy PXC node confirmed the endpoint hangs:
$ curl --max-time 5 -d “hostname=test” ``http://mysqlcluster-pitr:8080/invalidate-cache/
curl: (28) Operation timed out after 5001 milliseconds with 0 bytes received

Killing the notify handler shell process immediately unblocked everything — IST completed in seconds:
2026-03-24T13:33:28 Receiving IST… 100.0% (27490/27490 events) complete.
2026-03-24T13:33:28 Shifting JOINER → JOINED

Wanted to ask if there’s a quick fix for this to prevent this scenario, and wanted to see if we could make this curl request async or add retries?

@Ege_Gunes