PXC cluster not starting in kind

Hi,

I am running into the exact same issue just now. I already have a setup of the operator that works on my laptop which I installed with helm. Following the steps from the docs on helm leads me to this issue. Something has changed since Jan 6 2023 when I last had to set it up. I see that there has been a new helm chart version since that date which is v1.12.1. On my running setup, it is v1.12.0.
I tried using that older version just in case but the result is still the same. I also got the exact same cr.yaml file that I used for the successful deployment from git and applied it against both v1.12.0 and v1.12.1 but no luck.
Something weird is happening because based on my version control I am doing the exact same thing with the exact same versions of everything just as I did on Jan 6 2023 but I am getting a different result. Could there have been a change in the helm chart without bumping the version (backporting something idk)?

System:

# uname -a
Linux pop-os 6.0.12-76060006-generic #202212290932~1674139725~22.04~ca93ccf SMP PREEMPT_DYNAMIC Thu J x86_64 x86_64 x86_64 GNU/Linux

Steps to reproduce:
With helm

kind create cluster --config kind-config.yaml
# kind-config.yaml
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
name: redstone
nodes:
  - role: control-plane
    image: kindest/node:v1.23.13
  - role: worker
    image: kindest/node:v1.23.13
  - role: worker
    image: kindest/node:v1.23.13
  - role: worker
    image: kindest/node:v1.23.13
# I have also tested with v1.22.15 -> same issue

I have already added the percona helm repo :slight_smile: and did a helm repo update.

helm install my-xtdb-op percona/pxc-operator

which gets the operator up and running and installs CRDs:

NAME                                       READY   STATUS    RESTARTS   AGE
my-xtdb-op-pxc-operator-85c45b4549-qvn2d   1/1     Running   0          6s

and then:

helm install my-db percona/pxc-db \                           
--set haproxy.enabled=false \
--set proxysql.enabled=true \
--set logcollector.enabled=false

# I am using proxysql but I also got the exact same issue without setting anything custom and using haproxy.

This gets us here:

❯ kubectl get pods,pxc
NAME                                           READY   STATUS    RESTARTS   AGE
pod/my-db-pxc-db-proxysql-0                    3/3     Running   0          115s
pod/my-db-pxc-db-proxysql-1                    3/3     Running   0          107s
pod/my-db-pxc-db-proxysql-2                    3/3     Running   0          97s
pod/my-db-pxc-db-pxc-0                         0/1     Running   0          115s
pod/my-xtdb-op-pxc-operator-85c45b4549-qvn2d   1/1     Running   0          4m33s

NAME                                                ENDPOINT                            STATUS         PXC   PROXYSQL   HAPROXY   AGE
perconaxtradbcluster.pxc.percona.com/my-db-pxc-db   my-db-pxc-db-proxysql-unready.ivo   initializing         3                    116s

and the my-db-pxc-db-pxc-0 pod is spamming:

│ + '[' '' = Synced ']'                                                                                                                                             │
│ + echo 'MySQL init process in progress...'                                                                                                                        │
│ + sleep 1                                                                                                                                                         │
│ MySQL init process in progress...                                                                                                                                 │
│ + for i in {120..0}                                                                                                                                               │
│ ++ echo 'SELECT variable_value FROM performance_schema.global_status WHERE variable_name='\''wsrep_local_state_comment'\'''                                       │
│ ++ mysql --protocol=socket -uroot -hlocalhost --socket=/var/lib/mysql/mysql.sock --password= -s                                                                   │
│ + wsrep_local_state=                                                                                                                                              │
│ MySQL init process in progress...                                                                                                                                 │

After some time or with a kill pod my-db-pxc-db-pxc-0 gets restarted and the logs change. After the first restart the PXC-0 pod is sitting at:

│ 2023-02-28T06:54:37.828344Z 0 [Note] [MY-000000] [Galera] wsrep_load(): loading provider library 'none'                                                           │
│ 2023-02-28T06:54:37.830008Z 0 [System] [MY-011323] [Server] X Plugin ready for connections. Bind-address: '::' port: 33060, socket: /var/lib/mysql/mysqlx.sock    │
│ 2023-02-28T06:54:37.830076Z 0 [System] [MY-010931] [Server] /usr/sbin/mysqld: ready for connections. Version: '8.0.29-21.1'  socket: '/var/lib/mysql/mysql.sock'  │

but kubernetes is not marking the pod as ready due to a failed ready check that looks like this:

│ Events:                                                                                                                                                           │
│   Type     Reason     Age                    From               Message                                                                                           │
│   ----     ------     ----                   ----               -------                                                                                           │
│   Normal   Scheduled  4m34s                  default-scheduler  Successfully assigned ivo/my-db-pxc-db-pxc-0 to redstone-worker3                                  │
│   Normal   Pulling    4m34s                  kubelet            Pulling image "percona/percona-xtradb-cluster-operator:1.12.0"                                    │
│   Normal   Pulled     4m33s                  kubelet            Successfully pulled image "percona/percona-xtradb-cluster-operator:1.12.0" in 800.88027ms         │
│   Normal   Created    4m33s                  kubelet            Created container pxc-init                                                                        │
│   Normal   Started    4m33s                  kubelet            Started container pxc-init                                                                        │
│   Normal   Pulled     4m31s                  kubelet            Successfully pulled image "percona/percona-xtradb-cluster:8.0.29-21.1" in 760.944329ms            │
│   Normal   Pulling    2m11s (x2 over 4m31s)  kubelet            Pulling image "percona/percona-xtradb-cluster:8.0.29-21.1"                                        │
│   Normal   Pulled     2m11s                  kubelet            Successfully pulled image "percona/percona-xtradb-cluster:8.0.29-21.1" in 758.547316ms            │
│   Normal   Created    2m10s (x2 over 4m31s)  kubelet            Created container pxc                                                                             │
│   Normal   Started    2m10s (x2 over 4m30s)  kubelet            Started container pxc                                                                             │
│   Warning  Unhealthy  4s (x10 over 4m4s)     kubelet            Readiness probe failed: ERROR 2003 (HY000): Can't connect to MySQL server on '10.244.1.11:33062'  │
│ (111)                                                                                                                                                             │
│ + [[ '' == \P\r\i\m\a\r\y ]]                                                                                                                                      │
│ + exit 1                                                                                                                                                          │

No luck :frowning:

EDIT 2:
The first and only pxc pod gets stuck right before it needs to open the Admin interface which is at port 33062. The logs on a normal running look something like this:

....
│ 2023-02-28T06:54:37.828344Z 0 [Note] [MY-000000] [Galera] wsrep_load(): loading provider library 'none'                                                           │
│ 2023-02-28T06:54:37.830008Z 0 [System] [MY-011323] [Server] X Plugin ready for connections. Bind-address: '::' port: 33060, socket: /var/lib/mysql/mysqlx.sock    │
│ 2023-02-28T06:54:37.830076Z 0 [System] [MY-010931] [Server] /usr/sbin/mysqld: ready for connections. Version: '8.0.29-21.1'  socket: '/var/lib/mysql/mysql.sock'  │

# This is the key part that we are missing. ofc this log line is from a totally different running one that I am scared to touch atm. 
| 2023-02-28T12:39:11.383051Z 0 [System] [MY-013292] [Server] Admin interface ready for connections, address: '10.244.2.8'  port: 33062
....

Because the PXC pod does not open its Admin interface on port 33062 then the readiness check is not able to pass because it queries the Admin interface. PXC pods are spawned sequentially if the first one

I have the feeling the following is happening but I don’t know why atm:

  1. First PXC pod loops on MySQL init process in progress... and eventually gets killed by k8s
    1.1. Just in case they are related: https://forums.percona.com/t/bug-in-entry-entrypoint-sh-in-docker-image-based-setup/19585/2
  2. On the first restart of the pod because the init process didn’t go nicely it does not open Admin Interface
  3. No Admin Interface on port 33062 → no ready check → no cluster

After this, I went directly from the source and directly from the repo with the tag v1.12.0 just as mentioned in the docs.

git clone -b v1.12.0 https://github.com/percona/percona-xtradb-cluster-operator
kubectl apply -f deploy/bundle.yaml
kubectl apply -f deploy/cr.yaml

Unfortunately, this also leads to the same behaviour as the first time the pod is spamming MySQL init process in progress and after a restart, it gets ready for connections but the ready check from kubernetes does not go through → pod never gets ready etc.

I am a bit at a loss here :frowning:

EDIT 1:
I just tested with minikube instead of kind and I get the same issue. I will now test on a fully-fledged kubernetes cluster on GCP (google) and will keep you posted.