Percona Helm Deployment failed (liveness and readiness probes issues)

Description:

Percona Xtradb installation using helm failed.

Environment: K8s (1.27.3) Local hosted 4 node Cni Calico.

CR version 1.12.0
PXC 8.29

All cluster pods are not working. Continuously restarting and unable to intialize further pods.

Issues in Installation

Kubectl describe result of Haproxy Pods

Kubectl describe result of Pxc Pods

Logs of Haproxy and PXC Pods
halogs.log (16.6 KB)
pxc.log (2.5 KB)

Can someone help me on this.

Hello @mzayn1 ,

I’ve seen such issue before and it was a result of poor networking or heavy utilized nodes.

Do you think this might be the issue?
Also a couple of asks:

  1. Please share your cr.yaml manifest
  2. Please try out Operator version 1.13 (we just released it and it has 60+ improvements)

UPD: another interesting thing is here: K8SPXC-1264 quotation syntax fix when sql_mode's ANSI_QUOTES enabled by yambottle · Pull Request #1442 · percona/percona-xtradb-cluster-operator · GitHub

Seems ANSI_QUOTES can affect the health check. Could you please check it too?

Hi @Sergey_Pronin

The issue has been resolved by updating the CNI Calico from v3.25 to v3.26 with Kubernetes Version 1.27.3.
Calico v3.25 is supported upto K8s v1.26 and I was using calico v3.25 with the latest version of K8s v1.27.

Thanks for you precious time and support.

I’ve run into the same problem, but:

  1. CNI is already Calico 3.26.1 on Kubernetes 1.27.2
  2. Operator is 1.13
  3. ANSI_QUOTES are not enabled

Differences are that the pxc/mysql pods are all working fine with no errors, and I can access them from the HAProxy pod directly using DNS Names and various ports without a problem.

kk -n percona get pods
NAME                                              READY   STATUS    RESTARTS       AGE
dev01-1-haproxy-0                       2/3     Running   3 (3m7s ago)   27m
dev01-1-pxc-0                           2/2     Running   0              28m
dev01-1-pxc-1                           2/2     Running   0              28m
dev01-1-pxc-2                           2/2     Running   0              29m
percona-xtradb-cluster-operator-f879dfdf4-f2nzs   1/1     Running   0              58m

Failure on the HAProxy pod:

+ exec haproxy -W -db -f /etc/haproxy-custom/haproxy-global.cfg -f /etc/haproxy/pxc/haproxy.cfg -p /etc/haproxy/pxc/haproxy.pid -S /etc/haproxy/pxc/haproxy-main.sock
[NOTICE]   (1) : New worker (10) forked
[NOTICE]   (1) : Loading success.
[WARNING]  (10) : kill 27
[WARNING]  (10) : Server galera-nodes/dev01-1-pxc-0 is DOWN, reason: External check timeout, code: 0, check duration: 10003ms. 0 active and 2 backup servers left. Running on backup. 0 sessions active, 0 requeued, 0 remaining in queue.
[WARNING]  (10) : kill 28
[WARNING]  (10) : Backup Server galera-nodes/dev01-1-pxc-2 is DOWN, reason: External check timeout, code: 0, check duration: 10003ms. 0 active and 1 backup servers left. Running on backup. 0 sessions active, 0 requeued, 0 remaining in queue.
[WARNING]  (10) : kill 29
[WARNING]  (10) : Backup Server galera-nodes/dev01-1-pxc-1 is DOWN, reason: External check timeout, code: 0, check duration: 10001ms. 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
[ALERT]    (10) : backend 'galera-nodes' has no server available!
[WARNING]  (10) : kill 30
[WARNING]  (10) : Server galera-admin-nodes/dev01-1-pxc-0 is DOWN, reason: External check timeout, code: 0, check duration: 10001ms. 0 active and 2 backup servers left. Running on backup. 0 sessions active, 0 requeued, 0 remaining in queue.
[WARNING]  (10) : kill 31
[WARNING]  (10) : Backup Server galera-admin-nodes/dev01-1-pxc-2 is DOWN, reason: External check timeout, code: 0, check duration: 10001ms. 0 active and 1 backup servers left. Running on backup. 0 sessions active, 0 requeued, 0 remaining in queue.
[WARNING]  (10) : kill 32
[WARNING]  (10) : Backup Server galera-admin-nodes/dev01-1-pxc-1 is DOWN, reason: External check timeout, code: 0, check duration: 10001ms. 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
[ALERT]    (10) : backend 'galera-admin-nodes' has no server available!
[WARNING]  (10) : kill 33
[WARNING]  (10) : Server galera-replica-nodes/dev01-1-pxc-0 is DOWN, reason: External check timeout, code: 0, check duration: 10001ms. 2 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
[WARNING]  (10) : kill 34
[WARNING]  (10) : Server galera-replica-nodes/dev01-1-pxc-1 is DOWN, reason: External check timeout, code: 0, check duration: 10001ms. 1 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
[WARNING]  (10) : kill 35
[WARNING]  (10) : Server galera-replica-nodes/dev01-1-pxc-2 is DOWN, reason: External check timeout, code: 0, check duration: 10001ms. 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
[ALERT]    (10) : backend 'galera-replica-nodes' has no server available!
[WARNING]  (10) : kill 36
[WARNING]  (10) : Server galera-mysqlx-nodes/dev01-1-pxc-0 is DOWN, reason: External check timeout, code: 0, check duration: 10001ms. 0 active and 2 backup servers left. Running on backup. 0 sessions active, 0 requeued, 0 remaining in queue.
[WARNING]  (10) : kill 37
[WARNING]  (10) : Backup Server galera-mysqlx-nodes/dev01-1-pxc-2 is DOWN, reason: External check timeout, code: 0, check duration: 10001ms. 0 active and 1 backup servers left. Running on backup. 0 sessions active, 0 requeued, 0 remaining in queue.
[WARNING]  (10) : kill 38
[WARNING]  (10) : Backup Server galera-mysqlx-nodes/dev01-1-pxc-1 is DOWN, reason: External check timeout, code: 0, check duration: 10001ms. 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
[ALERT]    (10) : backend 'galera-mysqlx-nodes' has no server available!

I created a custom image and added >> appends to the check_pxc.sh echo’s file to output values to /tmp/external_checks.log but the file isn’t even created leading me to believe the external check is not even being called, or there is some permission error in doing so - running manually from a pod shell works and creates and populates the file without issue:

bash-5.1$ cat /usr/local/bin/check_pxc.sh |grep ">>"
echo $PXC_SERVER_IP >> /tmp/external_check.log
echo "The following values are used for PXC node $PXC_SERVER_IP in backend $HAPROXY_PROXY_NAME: " >> /tmp/external_check.log
echo "wsrep_local_state is ${PXC_NODE_STATUS[0]}; pxc_maint_mod is ${PXC_NODE_STATUS[1]}; wsrep_cluster_status is ${PXC_NODE_STATUS[2]}; $AVAILABLE_NODES nodes are available" >> /tmp/external_check.log
    echo "PXC node $PXC_SERVER_IP for backend $HAPROXY_PROXY_NAME is ok" >> /tmp/external_check.log
    echo "PXC node $PXC_SERVER_IP for backend $HAPROXY_PROXY_NAME is not ok" >> /tmp/external_check.log


bash-5.1$ /usr/local/bin/check_pxc.sh '' '' dev1-pxc-1.dev1-pxc.percona.svc.cluster.local


bash-5.1$ cat /tmp/external_check.log 
dev1-pxc-1.dev1-pxc.percona.svc.cluster.local
The following values are used for PXC node dev1-pxc-1.dev1-pxc.percona.svc.cluster.local in backend : 
wsrep_local_state is 4; pxc_maint_mod is DISABLED; wsrep_cluster_status is Primary; 3 nodes are available
PXC node dev1-pxc-1.dev1-pxc.percona.svc.cluster.local for backend  is ok

Hi Kim,
Which OS you are using?

Seems a networking Issue related to CNI may be your cluster have low resource because tigera-operator also eats up resources.

I have reproduced this issue with CR 1.13.0 on 4 node kind cluster with calico 3.26.1 and K8s 1.27.3. It works fine with me.

Nodes Info and Cluster Status:

Pod Shell:

Hi Kim,

did you ever find the root cause for this? We have the same issue, on a 1.29 cluster (EKS). Same entries in the log, ever so often one of the checks passes once for one backend server …