Description:
We have a few percona xtradb clusters running with 3 replica’s. However it seems that whenever a PXC pod crashes the operator has trouble self-healing the broken pod. As of now we were unable to find the reason for the crash in the first place. However we would like to understand first why the operator is not able to rejoin a broken PXC node. What is interesting is that if we delete the broken PXC pod percona seems to have no trouble re-deploying and joining the PXC pod to the cluster. Not sure if intended but this has to be done in a specific order aswell. In our example we have pxc0, pxc1 and pxc2. pxc0 and pxc2 were broken. You first have to delete pxc0 before the operator actually re-deploy’s the pod. If you only delete pxc2 it just ignores it.
Here is a log excerpt from the broken pod trying to re-join. It seems it has something to do with file permissions. We are running version 1.13 for the operator however an update to 1.15 is on the way:
{"log":"[Note] [Galera] Shifting PRIMARY -> JOINER (TO: 875776)\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"[Note] [Galera] Requesting state transfer: success, donor: 0\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"[Note] [Galera] Resetting GCache seqno map due to different histories.\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"[Note] [Galera] GCache history reset: bcf31623-5faa-11ef-bf1e-423110059338:0 -> bcf31623-5faa-11ef-bf1e-423110059338:875772\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"[Note] [Galera] (fa90e3f8-97f9, 'tcp://0.0.0.0:4567') turning message relay requesting off\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"[Note] [WSREP-SST] joiner: => Rate:[52.3 B/s] Avg:[52.3 B/s] Elapsed:0:00:03 Bytes: 171 B \n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"[Note] [WSREP-SST] Proceeding with SST.........\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"[Note] [WSREP-SST] rm: cannot remove '/var/lib/mysql/.nfs000000000000014100000010': Device or resource busy\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"[Note] [WSREP-SST] rm: cannot remove '/var/lib/mysql/.nfs00000000000001bd00000011': Device or resource busy\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"[Note] [WSREP-SST] rm: cannot remove '/var/lib/mysql/.nfs00000000000001c200000013': Device or resource busy\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"[Note] [WSREP-SST] rm: cannot remove '/var/lib/mysql/.nfs000000000000009800000014': Device or resource busy\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"[Note] [WSREP-SST] rm: cannot remove '/var/lib/mysql/.nfs00000000000000a500000015': Device or resource busy\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"[ERROR] [WSREP-SST] Cleanup after exit with status:1\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"[Note] [WSREP-SST] joiner: => Rate:[0.00 B/s] Avg:[0.00 B/s] Elapsed:0:00:10 Bytes: 0.00 B \r\u0007xbstream: Can't create directory './ssg/' (OS errno 116 - Stale file handle)\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"[Note] [WSREP-SST] /usr/bin/pxc_extra/pxb-8.0/bin/xbstream: failed to create file.\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"[Note] [WSREP-SST] exit code: 1\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"[Note] [WSREP-SST] joiner: => Rate:[21.7KiB/s] Avg:[21.7KiB/s] Elapsed:0:00:12 Bytes: 268KiB \n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"[ERROR] [WSREP] Process completed with error: wsrep_sst_xtrabackup-v2 --role 'joiner' --address '10.42.196.5' --datadir '/var/lib/mysql/' --basedir '/usr/' --plugindir '/usr/lib64/mysql/plugin/' --defaults-file '/etc/my.cnf' --defaults-group-suffix '' --parent '1' --mysqld-version '8.0.33-25.1' --binlog 'binlog' : 1 (Operation not permitted)\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"[Note] [WSREP-SST] 2024/08/23 11:37:06 socat[902] E write(1, 0x559784a27000, 8192): Broken pipe\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"[ERROR] [WSREP-SST] ******************* FATAL ERROR ********************** \n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"[ERROR] [WSREP-SST] Error while getting data from donor node: exit codes: 1 0 1\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"[ERROR] [WSREP-SST] Line 1391\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"[ERROR] [WSREP-SST] ****************************************************** \n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"[ERROR] [WSREP] Failed to read uuid:seqno from joiner script.\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"[ERROR] [WSREP] SST script aborted with error 1 (Operation not permitted)\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"[Note] [Galera] Processing SST received\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"[Note] [Galera] SST received: 00000000-0000-0000-0000-000000000000:-1\n","file":"/var/lib/mysql/mysqld-error.log"}
Here is the yaml we use for deploying the percona cluster:
apiVersion: pxc.percona.com/v1
kind: PerconaXtraDBCluster
metadata:
name: {{ .Env.DEPLOY_NAMESPACE }}
spec:
crVersion: 1.13.0
allowUnsafeConfigurations: true
updateStrategy: SmartUpdate
upgradeOptions:
apply: Disabled
# upgradeOptions:
# versionServiceEndpoint: https://check.percona.com
# apply: Recommended
# schedule: "0 4 * * *"
pxc:
size: 3
image: percona/percona-xtradb-cluster:8.0.33
autoRecovery: true
configuration: |
[mysql-server]
log-bin-trust-function-creators = 1
character-set-server = utf8mb3
innodb_log_buffer_size = 32M
innodb_log_file_size = 80M
max_allowed_packet = 20M
default-authentication-plugin = mysql_native_password
sql_mode = "STRICT_TRANS_TABLES,NO_ZERO_IN_DATE,NO_ZERO_DATE,ERROR_FOR_DIVISION_BY_ZERO,NO_ENGINE_SUBSTITUTION"
sql-generate-invisible-primary-key = ON
[mysqld]
max_connections=5250
log-bin-trust-function-creators = 1
binlog_space_limit = 8G
containerSecurityContext:
allowPrivilegeEscalation: false
capabilities:
drop:
- ALL
privileged: false
readOnlyRootFilesystem: false
runAsNonRoot: true
resources:
requests:
memory: 3.5Gi ## CHANGE THIS: fill in the desired RAM capacity for the MySQL containers
cpu: 200m ## CHANGE THIS: fill in the desired CPU capacity for the MySQL containers
limits:
memory: 4Gi ## CHANGE THIS: fill in the desired RAM capacity for the MySQL containers
cpu: 400m ## CHANGE THIS: fill in the desired CPU capacity for the MySQL containers
affinity:
antiAffinityTopologyKey: "kubernetes.io/hostname"
podDisruptionBudget:
maxUnavailable: 1
volumeSpec:
persistentVolumeClaim:
resources:
requests:
storage: 10Gi
gracePeriod: 600
haproxy:
enabled: true
size: 3
image: percona/percona-xtradb-cluster-operator:1.13.0-haproxy
containerSecurityContext:
allowPrivilegeEscalation: false
capabilities:
drop:
- ALL
readOnlyRootFilesystem: false
runAsNonRoot: true
resources:
requests:
memory: 128M
cpu: 200m
limits:
memory: 512M
cpu: 400m
affinity:
antiAffinityTopologyKey: "kubernetes.io/hostname"
podDisruptionBudget:
maxUnavailable: 1
gracePeriod: 30
logcollector:
enabled: true
image: percona/percona-xtradb-cluster-operator:1.13.0-logcollector
resources:
requests:
memory: 128M
cpu: 100m
limits:
memory: 256M
cpu: 200m
containerSecurityContext:
allowPrivilegeEscalation: false
capabilities:
drop:
- ALL
readOnlyRootFilesystem: false
runAsNonRoot: true
Steps to Reproduce:
Not sure. We seems to intermittently have this issue.
Version:
percona 1.13
Logs:
See above