PXC cluster node crashed and cannot recover

Description:

This happens frequently enough that it gets really annoying, out of the two database instances running they frequently enough crash with no easy to diagnose reason and cannot recover themselves leaving the pod in a CrashLoopBackoff

This looks like

mysql-pxc-db-pxc-0                                      4/4     Running            13 (5d17h ago)    59d
mysql-pxc-db-pxc-1                                      4/4     Running            0                 59d
mysql-pxc-db-pxc-2                                      3/4     CrashLoopBackOff   46 (75s ago)      4h9m

And a describe of the pod in question has the following segment

  pxc:
    Container ID:  containerd://7894352f5a174ba0a5201aeda313a26c7494d8ec3c200fd96fece476e5e8834c
    Image:         percona/percona-xtradb-cluster:8.0.41-32.1
    Image ID:      docker.io/percona/percona-xtradb-cluster@sha256:168ffb252d533b856a74820dea51c155bf5a8cb6a806a4d8a2e387ed7417a733
    Ports:         3306/TCP, 4444/TCP, 4567/TCP, 4568/TCP, 33062/TCP, 33060/TCP
    Host Ports:    0/TCP, 0/TCP, 0/TCP, 0/TCP, 0/TCP, 0/TCP
    Command:
      /var/lib/mysql/pxc-entrypoint.sh
    Args:
      mysqld
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Tue, 19 Aug 2025 01:08:51 +1200
      Finished:     Tue, 19 Aug 2025 01:09:38 +1200
    Ready:          False
    Restart Count:  45
    Limits:
      cpu:     2
      memory:  12G
    Requests:
      cpu:      600m
      memory:   8G
    Liveness:   exec [/var/lib/mysql/liveness-check.sh] delay=300s timeout=5s period=10s #success=1 #failure=3
    Readiness:  exec [/var/lib/mysql/readiness-check.sh] delay=15s timeout=15s period=30s #success=1 #failure=5
    Environment Variables from:
      mysql-pxc-db-env-vars-pxc  Secret  Optional: true
    Environment:
      PXC_SERVICE:                    mysql-pxc-db-pxc-unready
      MONITOR_HOST:                   %
      MYSQL_ROOT_PASSWORD:            <set to the key 'root' in secret 'internal-mysql-pxc-db'>        Optional: false
      XTRABACKUP_PASSWORD:            <set to the key 'xtrabackup' in secret 'internal-mysql-pxc-db'>  Optional: false
      MONITOR_PASSWORD:               <set to the key 'monitor' in secret 'internal-mysql-pxc-db'>     Optional: false
      LOG_DATA_DIR:                   /var/lib/mysql
      IS_LOGCOLLECTOR:                yes
      CLUSTER_HASH:                   4212372
      OPERATOR_ADMIN_PASSWORD:        <set to the key 'operator' in secret 'internal-mysql-pxc-db'>  Optional: false
      LIVENESS_CHECK_TIMEOUT:         5
      READINESS_CHECK_TIMEOUT:        15
      DEFAULT_AUTHENTICATION_PLUGIN:  caching_sha2_password
      MYSQL_NOTIFY_SOCKET:            /var/lib/mysql/notify.sock
      MYSQL_STATE_FILE:               /var/lib/mysql/mysql.state

And the following status

Events:
  Type     Reason     Age    From               Message
  ----     ------     ----   ----               -------
  Normal   Scheduled  2m25s  default-scheduler  Successfully assigned default/mysql-pxc-db-pxc-2 to ip-10-1-0-193.ap-southeast-2.compute.internal
  Normal   Pulling    2m22s  kubelet            Pulling image "percona/percona-xtradb-cluster-operator:1.17.0"
  Normal   Pulled     2m20s  kubelet            Successfully pulled image "percona/percona-xtradb-cluster-operator:1.17.0" in 1.535s (1.535s including waiting). Image size: 87993197 bytes.
  Normal   Created    2m20s  kubelet            Created container pxc-init
  Normal   Started    2m20s  kubelet            Started container pxc-init
  Normal   Pulling    2m17s  kubelet            Pulling image "percona/percona-xtradb-cluster-operator:1.17.0-logcollector-fluentbit4.0.0"
  Normal   Pulled     2m16s  kubelet            Successfully pulled image "percona/percona-xtradb-cluster-operator:1.17.0-logcollector-fluentbit4.0.0" in 1.544s (1.544s including waiting). Image size: 136426815 bytes.
  Normal   Created    2m16s  kubelet            Created container logs
  Normal   Started    2m16s  kubelet            Started container logs
  Normal   Pulling    2m16s  kubelet            Pulling image "percona/percona-xtradb-cluster-operator:1.17.0-logcollector-fluentbit4.0.0"
  Normal   Pulled     2m14s  kubelet            Successfully pulled image "percona/percona-xtradb-cluster-operator:1.17.0-logcollector-fluentbit4.0.0" in 1.487s (1.487s including waiting). Image size: 136426815 bytes.
  Normal   Created    2m14s  kubelet            Created container logrotate
  Normal   Started    2m14s  kubelet            Started container logrotate
  Normal   Pulled     2m13s  kubelet            Successfully pulled image "percona/percona-xtradb-cluster:8.0.41-32.1" in 1.559s (1.559s including waiting). Image size: 211194772 bytes.
  Normal   Pulling    2m13s  kubelet            Pulling image "prom/mysqld-exporter"
  Normal   Pulled     2m11s  kubelet            Successfully pulled image "prom/mysqld-exporter" in 1.494s (1.495s including waiting). Image size: 10954979 bytes.
  Normal   Created    2m11s  kubelet            Created container mysqld-exporter
  Normal   Started    2m11s  kubelet            Started container mysqld-exporter
  Warning  Unhealthy  112s   kubelet            Readiness probe failed: ERROR 2003 (HY000): Can't connect to MySQL server on '10.1.0.78:33062' (111)
+ [[ '' == \P\r\i\m\a\r\y ]]
+ exit 1
  Normal  Pulling  85s (x2 over 2m14s)  kubelet  Pulling image "percona/percona-xtradb-cluster:8.0.41-32.1"
  Normal  Created  84s (x2 over 2m13s)  kubelet  Created container pxc
  Normal  Started  84s (x2 over 2m13s)  kubelet  Started container pxc
  Normal  Pulled   84s                  kubelet  Successfully pulled image "percona/percona-xtradb-cluster:8.0.41-32.1" in 1.516s (1.516s including waiting). Image size: 211194772 bytes.

Another similar issue was posted but never resolved Pxc-db cluster unable to recover after crash - #5 by Michael_Coburn

Steps to Reproduce:

It’s intermittent and not easy to reproduce

Let a workload run for 60+ days receiving 50-100 rps across a very simple schema (there aren’t even foreign key relationships between tables), and it might crash.. might not.. could crash early, could not crash.

Version:

Operator: 1.17.0
Container: percona/percona-xtradb-cluster:8.0.41-32.1

Logs:

Log container in the Pod with the reboot loop

{"log":"2025-08-18T13:52:15.678791Z 0 [Note] [MY-000000] [Galera] Failed to establish connection: certificate verify failed: self-signed certificate\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2025-08-18T13:52:16.677708Z 0 [Note] [MY-000000] [Galera] Failed to establish connection: certificate verify failed: self-signed certificate\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2025-08-18T13:52:17.177883Z 0 [Note] [MY-000000] [Galera] Failed to establish connection: certificate verify failed: self-signed certificate\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2025-08-18T13:52:18.177639Z 0 [Note] [MY-000000] [Galera] Failed to establish connection: certificate verify failed: self-signed certificate\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2025-08-18T13:52:18.677776Z 0 [Note] [MY-000000] [Galera] Failed to establish connection: certificate verify failed: self-signed certificate\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2025-08-18T13:52:19.677904Z 0 [Note] [MY-000000] [Galera] Failed to establish connection: certificate verify failed: self-signed certificate\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2025-08-18T13:52:20.178119Z 0 [Note] [MY-000000] [Galera] Failed to establish connection: certificate verify failed: self-signed certificate\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2025-08-18T13:52:21.177908Z 0 [Note] [MY-000000] [Galera] Failed to establish connection: certificate verify failed: self-signed certificate\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2025-08-18T13:52:21.180756Z 0 [Note] [MY-000000] [Galera] PC protocol downgrade 1 -> 0\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2025-08-18T13:52:21.180790Z 0 [Note] [MY-000000] [Galera] Current view of cluster as seen by this node\nview ((empty))\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2025-08-18T13:52:21.180931Z 0 [ERROR] [MY-000000] [Galera] failed to open gcomm backend connection: 110: failed to reach primary view (pc.wait_prim_timeout)\n\t at ../../../../percona-xtradb-cluster-galera/gcomm/src/pc.cpp:connect():176\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2025-08-18T13:52:21.181000Z 0 [ERROR] [MY-000000] [Galera] ../../../../percona-xtradb-cluster-galera/gcs/src/gcs_core.cpp:gcs_core_open():256: Failed to open backend connection: -110 (Connection timed out)\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2025-08-18T13:52:22.181151Z 0 [Note] [MY-000000] [Galera] gcomm: terminating thread\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2025-08-18T13:52:22.181219Z 0 [Note] [MY-000000] [Galera] gcomm: joining thread\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2025-08-18T13:52:22.181364Z 0 [ERROR] [MY-000000] [Galera] ../../../../percona-xtradb-cluster-galera/gcs/src/gcs.cpp:gcs_open():1952: Failed to open channel 'mysql-pxc-db-pxc' at 'gcomm://mysql-pxc-db-pxc-0.mysql-pxc-db-pxc,mysql-pxc-db-pxc-1.mysql-pxc-db-pxc': -110 (Connection timed out)\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2025-08-18T13:52:22.181382Z 0 [ERROR] [MY-000000] [Galera] gcs connect failed: Operation timed out\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2025-08-18T13:52:22.181395Z 0 [ERROR] [MY-000000] [WSREP] Provider/Node (gcomm://mysql-pxc-db-pxc-0.mysql-pxc-db-pxc,mysql-pxc-db-pxc-1.mysql-pxc-db-pxc) failed to establish connection with cluster (reason: 7)\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2025-08-18T13:52:22.181418Z 0 [ERROR] [MY-010119] [Server] Aborting\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2025-08-18T13:52:22.181771Z 0 [System] [MY-010910] [Server] /usr/sbin/mysqld: Shutdown complete (mysqld 8.0.41-32.1)  Percona XtraDB Cluster (GPL), Release rel32, Revision 9cd31bf, WSREP version 26.1.4.3.\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2025-08-18T13:52:22.182743Z 0 [ERROR] [MY-010065] [Server] Failed to shutdown components infrastructure.\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2025-08-18T13:52:22.182939Z 0 [Note] [MY-000000] [Galera] dtor state: CLOSED\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2025-08-18T13:52:22.182971Z 0 [Note] [MY-000000] [Galera] MemPool(TrxHandleSlave): hit ratio: 0, misses: 0, in use: 0, in pool: 0\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2025-08-18T13:52:22.186036Z 0 [Note] [MY-000000] [Galera] apply mon: entered 0\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2025-08-18T13:52:22.189065Z 0 [Note] [MY-000000] [Galera] apply mon: entered 0\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2025-08-18T13:52:22.192222Z 0 [Note] [MY-000000] [Galera] apply mon: entered 0\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2025-08-18T13:52:22.192260Z 0 [Note] [MY-000000] [Galera] cert index usage at exit 0\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2025-08-18T13:52:22.192298Z 0 [Note] [MY-000000] [Galera] cert trx map usage at exit 0\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2025-08-18T13:52:22.192308Z 0 [Note] [MY-000000] [Galera] deps set usage at exit 0\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2025-08-18T13:52:22.192323Z 0 [Note] [MY-000000] [Galera] avg deps dist 0\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2025-08-18T13:52:22.192333Z 0 [Note] [MY-000000] [Galera] avg cert interval 0\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2025-08-18T13:52:22.192342Z 0 [Note] [MY-000000] [Galera] cert index size 0\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2025-08-18T13:52:22.192407Z 0 [Note] [MY-000000] [Galera] Service thread queue flushed.\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2025-08-18T13:52:22.192481Z 0 [Note] [MY-000000] [Galera] wsdb trx map usage 0 conn query map usage 0\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2025-08-18T13:52:22.192496Z 0 [Note] [MY-000000] [Galera] MemPool(LocalTrxHandle): hit ratio: 0, misses: 0, in use: 0, in pool: 0\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2025-08-18T13:52:22.192636Z 0 [Note] [MY-000000] [Galera] Shifting CLOSED -> DESTROYED (TO: 0)\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2025-08-18T13:52:22.193683Z 0 [Note] [MY-000000] [Galera] Flushing memory map to disk...\n","file":"/var/lib/mysql/mysqld-error.log"}

PXC container in the Pod with the reboot loop

Cluster address set to: mysql-pxc-db-pxc-0.mysql-pxc-db-pxc,mysql-pxc-db-pxc-1.mysql-pxc-db-pxc
8.0.41-32.1
[mysqld]
pxc-encrypt-cluster-traffic=ON
ssl-ca=/etc/mysql/ssl-internal/ca.crt
ssl-key=/etc/mysql/ssl-internal/tls.key
ssl-cert=/etc/mysql/ssl-internal/tls.crt
wsrep_provider_options="pc.weight=10"

wsrep_sst_donor=mysql-pxc-db-pxc-1,

log-error=/var/lib/mysql/mysqld-error.log

log_error_suppression_list="MY-010055"

admin-address=10.1.0.144

authentication_policy=caching_sha2_password,,
skip_replica_start=ON
wsrep_notify_cmd=/var/lib/mysql/wsrep_cmd_notify_handler.sh
enforce-gtid-consistency
gtid-mode=ON
plugin_load="binlog_utils_udf=binlog_utils_udf.so"

datadir=/var/lib/mysql
socket=/tmp/mysql.sock
skip-host-cache

coredumper
server_id=42123722
binlog_format=ROW
default_storage_engine=InnoDB

innodb_flush_log_at_trx_commit  = 2
innodb_flush_method             = O_DIRECT
innodb_file_per_table           = 1
innodb_autoinc_lock_mode=2

bind_address = 0.0.0.0

wsrep_slave_threads=2
wsrep_cluster_address=gcomm://mysql-pxc-db-pxc-0.mysql-pxc-db-pxc,mysql-pxc-db-pxc-1.mysql-pxc-db-pxc
wsrep_provider=/usr/lib64/galera4/libgalera_smm.so

wsrep_cluster_name=mysql-pxc-db-pxc
wsrep_node_address=10.1.0.144
wsrep_node_incoming_address=mysql-pxc-db-pxc-2.mysql-pxc-db-pxc.default.svc.cluster.local:3306

wsrep_sst_method=xtrabackup-v2

[client]
socket=/tmp/mysql.sock

[sst]
cpat=.*\.pem$\|.*init\.ok$\|.*galera\.cache$\|.*wsrep_recovery_verbose\.log$\|.*readiness-check\.sh$\|.*liveness-check\.sh$\|.*get-pxc-state$\|.*sst_in_progress$\|.*sleep-forever$\|.*pmm-prerun\.sh$\|.*sst-xb-tmpdir$\|.*\.sst$\|.*gvwstate\.dat$\|.*grastate\.dat$\|.*\.err$\|.*\.log$\|.*RPM_UPGRADE_MARKER$\|.*RPM_UPGRADE_HISTORY$\|.*pxc-entrypoint\.sh$\|.*unsafe-bootstrap\.sh$\|.*pxc-configure-pxc\.sh\|.*peer-list$\|.*auth_plugin$\|.*version_info$\|.*mysql-state-monitor$\|.*mysql-state-monitor\.log$\|.*notify\.sock$\|.*mysql\.state$\|.*wsrep_cmd_notify_handler\.sh$
progress=1

+ [[ -z node:10-1-0-242.mysql-pxc-db-pxc-unready.default.svc.cluster.local:wsrep_ready:ON:wsrep_connected:ON:wsrep_local_state_comment:Synced:wsrep_cluster_status:Primary ]]
+ [[ -z node:10-1-0-242.mysql-pxc-db-pxc-unready.default.svc.cluster.local:wsrep_ready:ON:wsrep_connected:ON:wsrep_local_state_comment:Synced:wsrep_cluster_status:Primary ]]
+ [[ -z node:10-1-0-242.mysql-pxc-db-pxc-unready.default.svc.cluster.local:wsrep_ready:ON:wsrep_connected:ON:wsrep_local_state_comment:Synced:wsrep_cluster_status:Primary ]]
+ test -e /opt/percona/hookscript/hook.sh
+ init_opt=
+ [[ -f /etc/mysql/init-file/init.sql ]]

There are a whole lot more logs related to this, but the initial problem to solve is why it won’t reconnect. Deleting the pod and the underlying PVC doesn’t work. After the reconnection issue is resolved, the source of the crash should be identified.

Expected Result:

Pod to come back and rejoin the cluster without any fuss

Actual Result:

Pod goes into a reboot loop

Additional Information:

It seems like the reboot loop is caused by the pod needing to replicate the data to catch up to where it was but this takes longer than the time the pod has to rejoin and the liveness and readiness checks trigger failures causing it to reboot, restart since it didn’t finish, and die again.

This issue appears to be related to encrypted traffic between the PXC nodes using SSL certificates. Please ensure that all PXC cluster nodes have valid and consistent certificates in place to establish connectivity with other nodes when default cluster traffic encryption is enabled (pxc-encrypt-cluster-traffic=ON).

Hi @Gerwin_van_de_Steeg,

Liveness probe should be disabled while SST in progress. From the logs it looks like database becomes unresponsive for a while and then killed by liveness probe. You can try setting a higher timeout for liveness probes by setting LIVENESS_CHECK_TIMEOUT env var in pxc pods. Also I wonder how big is your dataset.

TLS is enabled as per the settings of the chart used to deploy them.

Checking the TLS certificates shows they’re getting updated and managed by cert-manager and refreshed accordingly. However I can’t tell if the database is being instructed to reload the certificates when they’re updated by cert-manager.

bash-5.1$ openssl x509 -noout -serial -dates -subject -in /etc/mysql/ssl-internal/tls.crt
serial=F0B167498F8C94A2BB66A05038E1F75B
notBefore=Aug 18 09:06:47 2025 GMT
notAfter=Nov 16 09:06:47 2025 GMT
subject=CN=mysql-pxc-db-pxc

bash-5.1$ openssl x509 -noout -serial -dates -subject -in /etc/mysql/ssl/tls.crt
serial=4A2A88E44CD9D7419F2A67E487B360A9
notBefore=Aug 18 09:06:47 2025 GMT
notAfter=Nov 16 09:06:47 2025 GMT
subject=CN=mysql-pxc-db-proxysql

However since the pod with the issue is `mysql-pxc-db-pxc-2` which has been deleted and is in a reboot loop we know that it is guaranteed to get the latest version of the certificates injected into it.

I’ve run a manual ALTER INSTANCE RELOAD TLS on both working instances but that’s no guarantees for whether the updated certificates are loaded.

Running a packet capture on the connection setup when running the mysqlclient I can see the TLS certificate in the TLS handshake having the updated expiration time and the subject of mysql-pxc-db-pxc which seems to be that internal certificate.

So the issue doesn’t appear to be the TLS certificate from the information available. Both still functional pods seems to provide the right TLS details when connections are attempted to the mysql client port on 3306.

The pod does have a variety of ports open

    Ports:         3306/TCP, 4444/TCP, 4567/TCP, 4568/TCP, 33062/TCP, 33060/TCP

Looking at the gcomm port of 4567 we can see the problem I would expect.

$ kubectl exec -it mysql-pxc-db-pxc-1 -c pxc -- bash
bash-5.1$ openssl s_client 127.0.0.1:4567
Connecting to 127.0.0.1
CONNECTED(00000003)
Can't use SSL_get_servername
depth=0 CN=mysql-pxc-db-pxc
verify error:num=18:self-signed certificate
verify return:1
depth=0 CN=mysql-pxc-db-pxc
verify return:1
---
Certificate chain
 0 s:CN=mysql-pxc-db-pxc
   i:CN=mysql-pxc-db-pxc
   a:PKEY: rsaEncryption, 2048 (bit); sigalg: RSA-SHA256
   v:NotBefore: Jun 19 09:06:47 2025 GMT; NotAfter: Sep 17 09:06:47 2025 GMT

We can see that workload is still serving the old TLS certificate, how would that get restarted/reloaded with the new TLS certificate without taking down the database preferably?

Edit: It also seems to eventually cause the entire cluster to crash hard within 48hrs of that certificate expiring.

Cheers,

The database is about 20G in size.

And that’s the little one, the other MySQL instance 24G in size.

And that’s the problem again, ~59 days later since the most recent crash, and the same node dies again. And if we look at the gcomm port we can see that it is serving an expired TLS certificate. Even if we tell mysql to reload

The running pod status we can see:

mysql-pxc-db-pxc-0                                   4/4     Running            2 (14d ago)        59d
mysql-pxc-db-pxc-1                                   4/4     Running            0                  14d
mysql-pxc-db-pxc-2                                   3/4     CrashLoopBackOff   11 (11s ago)       41m

And if we check the TLS certificate served by the gcom port:

$  kubectl exec -it mysql-pxc-db-pxc-1 -c pxc -- bash
bash-5.1$ openssl s_client 127.0.0.1:4567
Connecting to 127.0.0.1
CONNECTED(00000003)
Can't use SSL_get_servername
depth=0 CN=mysql-pxc-db-pxc
verify error:num=18:self-signed certificate
verify return:1
depth=0 CN=mysql-pxc-db-pxc
verify return:1
---
Certificate chain
 0 s:CN=mysql-pxc-db-pxc
   i:CN=mysql-pxc-db-pxc
   a:PKEY: rsaEncryption, 2048 (bit); sigalg: RSA-SHA256
   v:NotBefore: Aug 18 09:06:47 2025 GMT; NotAfter: Nov 16 09:06:47 2025 GMT
---
Server certificate
-----BEGIN CERTIFICATE-----

And the TLS certificate served by the mysql port

bash-5.1$ openssl s_client -starttls mysql -connect localhost:3306 </dev/null 2>/dev/null | openssl x509 -dates -noout
notBefore=Aug 18 09:06:47 2025 GMT
notAfter=Nov 16 09:06:47 2025 GMT

And the date on the pod is..

bash-5.1$ date -u
Fri Oct 17 09:55:05 UTC 2025

And the logs on the failing node tells us over and over again.

$ kubectl logs mysql-pxc-db-pxc-2 -c logs
{"log":"2025-10-17T09:53:57.839756Z 0 [Note] [MY-000000] [Galera] Failed to establish connection: certificate verify failed: self-signed certificate\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2025-10-17T09:53:58.838950Z 0 [Note] [MY-000000] [Galera] Failed to establish connection: certificate verify failed: self-signed certificate\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2025-10-17T09:53:59.339052Z 0 [Note] [MY-000000] [Galera] Failed to establish connection: certificate verify failed: self-signed certificate\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2025-10-17T09:54:00.338931Z 0 [Note] [MY-000000] [Galera] Failed to establish connection: certificate verify failed: self-signed certificate\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2025-10-17T09:54:00.839071Z 0 [Note] [MY-000000] [Galera] Failed to establish connection: certificate verify failed: self-signed certificate\n","file":"/var/lib/mysql/mysqld-error.log"}

So we need to safely restart whatever that component is that is listening on the port without breaking the database.

If we use ALTER INSTANCE RELOAD TLS on the running database instances, we can see that the TLS certificate used to server the DB connection is correctly updated but not the gcomm one.

$ kubectl exec -it mysql-pxc-db-pxc-0 -c pxc -- bash
bash-5.1$ openssl s_client -starttls mysql -connect localhost:3306 </dev/null 2>/dev/null | openssl x509 -dates -noout
notBefore=Aug 18 09:06:47 2025 GMT
notAfter=Nov 16 09:06:47 2025 GMT
bash-5.1$ echo 'ALTER INSTANCE RELOAD TLS;' | mysql -u root -p
Enter password:
bash-5.1$ openssl s_client -starttls mysql -connect localhost:3306 </dev/null 2>/dev/null | openssl x509 -dates -noout
notBefore=Oct 17 09:06:47 2025 GMT
notAfter=Jan 15 09:06:47 2026 GMT
bash-5.1$ openssl s_client 127.0.0.1:4567
Connecting to 127.0.0.1
CONNECTED(00000003)
Can't use SSL_get_servername
depth=0 CN=mysql-pxc-db-pxc
verify error:num=18:self-signed certificate
verify return:1
depth=0 CN=mysql-pxc-db-pxc
verify return:1
---
Certificate chain
 0 s:CN=mysql-pxc-db-pxc
   i:CN=mysql-pxc-db-pxc
   a:PKEY: rsaEncryption, 2048 (bit); sigalg: RSA-SHA256
   v:NotBefore: Aug 18 09:06:47 2025 GMT; NotAfter: Nov 16 09:06:47 2025 GMT

Any thoughts on that? I’ve had a look to see if a SIGHUP or SIGUSR1 could be used but nothing like that works.

There’s nothing like a database workload failing reliably after 60 days every 60 days.

Cheers

While scaling the pxc database to size: 0 with unsafe: pxcSize: true and back up to the right size after all nodes have shut down does resolve the problem, that does take the entire database down for a period which is not really a sustainable option in a 24/7 environment.

@Gerwin_van_de_Steeg I agree that this sounds problematic. I will try to reproduce your problem next week and create tickets if necessary.

I don’t see a failure when I renew any certificates including CA.

One thing I realized is that I see a different certificate chain than yours:

```
bash-5.1$ openssl s_client 127.0.0.1:4567
Connecting to 127.0.0.1
CONNECTED(00000003)
Can't use SSL_get_servername
depth=1 CN=cluster1-ca
verify error:num=19:self-signed certificate in certificate chain
verify return:1
depth=1 CN=cluster1-ca
verify return:1
depth=0 CN=cluster1-pxc
verify return:1
---
Certificate chain
 0 s:CN=cluster1-pxc
   i:CN=cluster1-ca
   a:PKEY: rsaEncryption, 2048 (bit); sigalg: RSA-SHA256
   v:NotBefore: Oct 31 15:47:08 2025 GMT; NotAfter: Jan 29 15:47:08 2026 GMT
 1 s:CN=cluster1-ca
   i:CN=cluster1-ca
   a:PKEY: rsaEncryption, 2048 (bit); sigalg: RSA-SHA256
   v:NotBefore: Oct 31 15:31:44 2025 GMT; NotAfter: Oct 30 15:31:44 2028 GMT
```

while yours look like:

bash-5.1$ openssl s_client 127.0.0.1:4567
Connecting to 127.0.0.1
CONNECTED(00000003)
Can't use SSL_get_servername
depth=0 CN=mysql-pxc-db-pxc
verify error:num=18:self-signed certificate
verify return:1
depth=0 CN=mysql-pxc-db-pxc
verify return:1
---
Certificate chain
 0 s:CN=mysql-pxc-db-pxc
   i:CN=mysql-pxc-db-pxc
   a:PKEY: rsaEncryption, 2048 (bit); sigalg: RSA-SHA256
   v:NotBefore: Aug 18 09:06:47 2025 GMT; NotAfter: Nov 16 09:06:47 2025 GMT

Do you have cluster1-ca-cert Certificate and Secret objects? Do you use a custom cert-manager Issuer configuration in cr.yaml?