SmartUpdate breaks pxc pod when "applying changes"

Description:

I’m running the operator and cluster deployed via helm. Initially, the cluster is running without problems, but after a few moments, the operator wants to “apply changes to secondary pod” according to its logs. This restarts the targeted pod, but it’s not able to reach a working state. The mysqld --wsrep_start_position=... command fails.

In the pod logs, I see these errors:

pxc pod logs
{"log":"2024-08-23T10:48:09.639919Z 0 [Note] [MY-000000] [Galera] PC protocol downgrade 1 -> 0\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2024-08-23T10:48:09.640023Z 0 [Note] [MY-000000] [Galera] Current view of cluster as seen by this node\nview ((empty))\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2024-08-23T10:48:09.647832Z 0 [ERROR] [MY-000000] [Galera] failed to open gcomm backend connection: 110: failed to reach primary view (pc.wait_prim_timeout): 110 (Connection timed out)\n\t at /mnt/jenkins/workspace/pxc80-autobuild-RELEASE/test/rpmbuild/BUILD/Percona-XtraDB-Cluster-8.0.36/percona-xtradb-cluster-galera/gcomm/src/pc.cpp:connect():176\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2024-08-23T10:48:09.653509Z 0 [ERROR] [MY-000000] [Galera] /mnt/jenkins/workspace/pxc80-autobuild-RELEASE/test/rpmbuild/BUILD/Percona-XtraDB-Cluster-8.0.36/percona-xtradb-cluster-galera/gcs/src/gcs_core.cpp:gcs_core_open():219: Failed to open backend connection: -110 (Connection timed out)\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2024-08-23T10:48:10.658832Z 0 [Note] [MY-000000] [Galera] gcomm: terminating thread\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2024-08-23T10:48:10.658894Z 0 [Note] [MY-000000] [Galera] gcomm: joining thread\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2024-08-23T10:48:10.658996Z 0 [ERROR] [MY-000000] [Galera] /mnt/jenkins/workspace/pxc80-autobuild-RELEASE/test/rpmbuild/BUILD/Percona-XtraDB-Cluster-8.0.36/percona-xtradb-cluster-galera/gcs/src/gcs.cpp:gcs_open():1880: Failed to open channel 'testapp-db-pxc' at 'gcomm://testapp-db-pxc-0.testapp-db-pxc,testapp-db-pxc-1.testapp-db-pxc': -110 (Connection timed out)\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2024-08-23T10:48:10.665159Z 0 [ERROR] [MY-000000] [Galera] gcs connect failed: Connection timed out\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2024-08-23T10:48:10.665196Z 0 [ERROR] [MY-000000] [WSREP] Provider/Node (gcomm://testapp-db-pxc-0.testapp-db-pxc,testapp-db-pxc-1.testapp-db-pxc) failed to establish connection with cluster (reason: 7)\n","file":"/var/lib/mysql/mysqld-error.log"}

I don’t know/understand what kind of update the operator is trying to do here. Any ideas what it might be and why it causes mysql to fail to start?

Version:

Helm charts “pxc-operator” and “pxc-db” 1.15.0

Helm values:

pxc-operator:

watchAllNamespaces: true

pxc-db:

pxc:
  persistence:
    storageClass: rook-ceph-block
    size: 2Gi
  disableTLS: true

Hi @dkorbginski,

I don’t know/understand what kind of update the operator is trying to do here. Any ideas what it might be and why it causes mysql to fail to start?

The cluster setting by following variable wsrep_start_position starting up MySQL processes.

Regarding the pod logs;
{"log":"2024-08-23T10:48:10.665159Z 0 [ERROR] [MY-000000] [Galera] gcs connect failed: Connection timed out\n","file":"/var/lib/mysql/mysqld-error.log"}
You should ensure all nodes can communicate with each other.