Operator fails to rejoin crashed nodes to cluster without deleting it manually

Description:

We have a few percona xtradb clusters running with 3 replica’s. However it seems that whenever a PXC pod crashes the operator has trouble self-healing the broken pod. As of now we were unable to find the reason for the crash in the first place. However we would like to understand first why the operator is not able to rejoin a broken PXC node. What is interesting is that if we delete the broken PXC pod percona seems to have no trouble re-deploying and joining the PXC pod to the cluster. Not sure if intended but this has to be done in a specific order aswell. In our example we have pxc0, pxc1 and pxc2. pxc0 and pxc2 were broken. You first have to delete pxc0 before the operator actually re-deploy’s the pod. If you only delete pxc2 it just ignores it.

Here is a log excerpt from the broken pod trying to re-join. It seems it has something to do with file permissions. We are running version 1.13 for the operator however an update to 1.15 is on the way:

{"log":"[Note] [Galera] Shifting PRIMARY -> JOINER (TO: 875776)\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"[Note] [Galera] Requesting state transfer: success, donor: 0\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"[Note] [Galera] Resetting GCache seqno map due to different histories.\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"[Note] [Galera] GCache history reset: bcf31623-5faa-11ef-bf1e-423110059338:0 -> bcf31623-5faa-11ef-bf1e-423110059338:875772\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"[Note] [Galera] (fa90e3f8-97f9, 'tcp://0.0.0.0:4567') turning message relay requesting off\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"[Note] [WSREP-SST]    joiner: => Rate:[52.3 B/s] Avg:[52.3 B/s] Elapsed:0:00:03  Bytes:  171 B \n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"[Note] [WSREP-SST] Proceeding with SST.........\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"[Note] [WSREP-SST] rm: cannot remove '/var/lib/mysql/.nfs000000000000014100000010': Device or resource busy\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"[Note] [WSREP-SST] rm: cannot remove '/var/lib/mysql/.nfs00000000000001bd00000011': Device or resource busy\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"[Note] [WSREP-SST] rm: cannot remove '/var/lib/mysql/.nfs00000000000001c200000013': Device or resource busy\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"[Note] [WSREP-SST] rm: cannot remove '/var/lib/mysql/.nfs000000000000009800000014': Device or resource busy\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"[Note] [WSREP-SST] rm: cannot remove '/var/lib/mysql/.nfs00000000000000a500000015': Device or resource busy\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"[ERROR] [WSREP-SST] Cleanup after exit with status:1\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"[Note] [WSREP-SST]    joiner: => Rate:[0.00 B/s] Avg:[0.00 B/s] Elapsed:0:00:10  Bytes: 0.00 B \r\u0007xbstream: Can't create directory './ssg/' (OS errno 116 - Stale file handle)\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"[Note] [WSREP-SST] /usr/bin/pxc_extra/pxb-8.0/bin/xbstream: failed to create file.\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"[Note] [WSREP-SST] exit code: 1\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"[Note] [WSREP-SST]    joiner: => Rate:[21.7KiB/s] Avg:[21.7KiB/s] Elapsed:0:00:12  Bytes:  268KiB \n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"[ERROR] [WSREP] Process completed with error: wsrep_sst_xtrabackup-v2 --role 'joiner' --address '10.42.196.5' --datadir '/var/lib/mysql/' --basedir '/usr/' --plugindir '/usr/lib64/mysql/plugin/' --defaults-file '/etc/my.cnf' --defaults-group-suffix '' --parent '1' --mysqld-version '8.0.33-25.1'  --binlog 'binlog' : 1 (Operation not permitted)\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"[Note] [WSREP-SST] 2024/08/23 11:37:06 socat[902] E write(1, 0x559784a27000, 8192): Broken pipe\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"[ERROR] [WSREP-SST] ******************* FATAL ERROR ********************** \n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"[ERROR] [WSREP-SST] Error while getting data from donor node:  exit codes: 1 0 1\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"[ERROR] [WSREP-SST] Line 1391\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"[ERROR] [WSREP-SST] ****************************************************** \n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"[ERROR] [WSREP] Failed to read uuid:seqno from joiner script.\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"[ERROR] [WSREP] SST script aborted with error 1 (Operation not permitted)\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"[Note] [Galera] Processing SST received\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"[Note] [Galera] SST received: 00000000-0000-0000-0000-000000000000:-1\n","file":"/var/lib/mysql/mysqld-error.log"}

Here is the yaml we use for deploying the percona cluster:

apiVersion: pxc.percona.com/v1
kind: PerconaXtraDBCluster
metadata:
  name: {{ .Env.DEPLOY_NAMESPACE }}
spec:
  crVersion: 1.13.0
  allowUnsafeConfigurations: true
  updateStrategy: SmartUpdate
  upgradeOptions:
    apply: Disabled
  # upgradeOptions:
  #   versionServiceEndpoint: https://check.percona.com
  #   apply: Recommended
  #   schedule: "0 4 * * *"
  pxc:
    size: 3
    image: percona/percona-xtradb-cluster:8.0.33
    autoRecovery: true
    configuration: |
      [mysql-server]
      log-bin-trust-function-creators = 1
      character-set-server = utf8mb3
      innodb_log_buffer_size = 32M
      innodb_log_file_size = 80M
      max_allowed_packet = 20M
      default-authentication-plugin = mysql_native_password
      sql_mode = "STRICT_TRANS_TABLES,NO_ZERO_IN_DATE,NO_ZERO_DATE,ERROR_FOR_DIVISION_BY_ZERO,NO_ENGINE_SUBSTITUTION"
      sql-generate-invisible-primary-key = ON
      [mysqld]
      max_connections=5250
      log-bin-trust-function-creators = 1
      binlog_space_limit = 8G
    containerSecurityContext:
      allowPrivilegeEscalation: false
      capabilities:
        drop:
          - ALL
      privileged: false
      readOnlyRootFilesystem: false
      runAsNonRoot: true
    resources:
      requests:
        memory: 3.5Gi ## CHANGE THIS: fill in the desired RAM capacity for the MySQL containers
        cpu: 200m ## CHANGE THIS: fill in the desired CPU capacity for the MySQL containers
      limits:
        memory: 4Gi ## CHANGE THIS: fill in the desired RAM capacity for the MySQL containers
        cpu: 400m ## CHANGE THIS: fill in the desired CPU capacity for the MySQL containers
    affinity:
      antiAffinityTopologyKey: "kubernetes.io/hostname"
    podDisruptionBudget:
      maxUnavailable: 1
    volumeSpec:
      persistentVolumeClaim:
        resources:
          requests:
            storage: 10Gi
    gracePeriod: 600
  haproxy:
    enabled: true
    size: 3
    image: percona/percona-xtradb-cluster-operator:1.13.0-haproxy
    containerSecurityContext:
      allowPrivilegeEscalation: false
      capabilities:
        drop:
          - ALL
      readOnlyRootFilesystem: false
      runAsNonRoot: true
    resources:
      requests:
        memory: 128M
        cpu: 200m
      limits:
        memory: 512M
        cpu: 400m
    affinity:
      antiAffinityTopologyKey: "kubernetes.io/hostname"
    podDisruptionBudget:
      maxUnavailable: 1
    gracePeriod: 30
  logcollector:
    enabled: true
    image: percona/percona-xtradb-cluster-operator:1.13.0-logcollector
    resources:
      requests:
        memory: 128M
        cpu: 100m
      limits:
        memory: 256M
        cpu: 200m
    containerSecurityContext:
      allowPrivilegeEscalation: false
      capabilities:
        drop:
          - ALL
      readOnlyRootFilesystem: false
      runAsNonRoot: true

Steps to Reproduce:

Not sure. We seems to intermittently have this issue.

Version:

percona 1.13

Logs:

See above

Hello @Sebastiaan_Villerius ,

somehow your post missed my filters. As it was posted quite some time ago - could you please share if it was resolved or you still need help?

I’m open to jump into a call with you to learn more about your use cases: Zoom Scheduler

Hi Sergey,

I’m a colleague of Sebastiaan. Thanks for replying.
It seems that we’ve managed to solve the issue!

Apparently we were running on a MySQL MySQL PXC image (v8.0.33), which is not supported by the MySQL PXC v1.15 operator.
After upgrading to (the supported) PXC image percona-xtradb-cluster:8.0.35-27.1, our issues are resolved.

You can proceed marking this issue as resolved :slight_smile:

Kind regards, Azam