PXC cluster not starting in kind

Hey,
I’ve been trying to set up percona pxc cluster for CI purposes for a few days now.
I use kind to setup the k8s cluster
Env details:
os: Ubuntu 22.04
k8s: v1.25.2
kind: 0.16.0

I’ve installed the percona pxc following the blog post here https://www.percona.com/blog/testing-percona-distribution-for-mysql-operator-locally-with-kind/ (the difference is that I use Ubuntu) without any modifications.
It seems that there is a networking issue although I’m not 100% sure.

As you can see I installed the cilium CNI but it hadn’t been working before with the kindnet too and basically I’d been getting the same errors.

It seems that mysqld started correctly (cluster1-pxc-0 logs):

2023-02-12T23:33:05.996245Z 0 [Warning] [MY-010097] [Server] Insecure configuration for --secure-log-path: Current value does not restrict locatio │
│ 2023-02-12T23:33:05.996290Z 0 [Warning] [MY-000000] [WSREP] Node is not a cluster node. Disabling pxc_strict_mode                                  │
│ 2023-02-12T23:33:05.997592Z 0 [System] [MY-010116] [Server] /usr/sbin/mysqld (mysqld 8.0.29-21.1) starting as process 1                            │
│ 2023-02-12T23:33:05.997669Z 0 [ERROR] [MY-010338] [Server] Can't find error-message file '/usr/share/percona-xtradb-cluster/errmsg.sys'. Check err │
│ 2023-02-12T23:33:06.001039Z 0 [Warning] [MY-010068] [Server] CA certificate ca.pem is self signed.                                                 │
│ 2023-02-12T23:33:06.001096Z 0 [System] [MY-013602] [Server] Channel mysql_main configured to support TLS. Encrypted connections are now supported  │
│ 2023-02-12T23:33:06.008611Z 1 [System] [MY-013576] [InnoDB] InnoDB initialization has started.                                                     │
│ 2023-02-12T23:33:06.237670Z 1 [System] [MY-013577] [InnoDB] InnoDB initialization has ended.                                                       │
│ 2023-02-12T23:33:06.342052Z 1 [Note] [MY-000000] [WSREP] wsrep_init_schema_and_SR (nil)                                                            │
│ 2023-02-12T23:33:06.347982Z 1 [System] [MY-000000] [WSREP] PXC upgrade completed successfully                                                      │
│ 2023-02-12T23:33:06.493585Z 0 [Warning] [MY-010068] [Server] CA certificate ca.pem is self signed.                                                 │
│ 2023-02-12T23:33:06.493614Z 0 [System] [MY-013602] [Server] Channel mysql_main configured to support TLS. Encrypted connections are now supported  │
│ 2023-02-12T23:33:06.499217Z 0 [Warning] [MY-011810] [Server] Insecure configuration for --pid-file: Location '/var/lib/mysql' in the path is acces │
│ 2023-02-12T23:33:06.513909Z 0 [Note] [MY-000000] [WSREP] Initialized wsrep sidno 2                                                                 │
│ 2023-02-12T23:33:06.513940Z 0 [Note] [MY-000000] [Galera] Loading provider none initial position: 00000000-0000-0000-0000-000000000000:-1          │
│ 2023-02-12T23:33:06.513953Z 0 [Note] [MY-000000] [Galera] wsrep_load(): loading provider library 'none'                                            │
│ 2023-02-12T23:33:06.514877Z 0 [System] [MY-011323] [Server] X Plugin ready for connections. Bind-address: '::' port: 33060, socket: /var/lib/mysql │
│ 2023-02-12T23:33:06.514953Z 0 [System] [MY-010931] [Server] /usr/sbin/mysqld: ready for connections. Version: '8.0.29-21.1'  socket: '/var/lib/mys │
│ Stream closed EOF for default/cluster1-pxc-0 (pxc)                                                                                          

The errors that I’m getting:

  1. cluster1-haproxy-0:pxc-monit
│ 2023/02/12 23:36:12 lookup cluster1-pxc on 10.96.0.10:53: no such host                                                                             │
│ 2023/02/12 23:36:13 lookup cluster1-pxc on 10.96.0.10:53: no such host
  1. Events from the pod:
  Normal   Pulled     56m                  kubelet  Successfully pulled image "percona/percona-xtradb-cluster-operator:1.12.0-haproxy" in 1.025621 │
│ 516s                                                                                                                                               │
│   Normal   Pulling    31m (x11 over 71m)   kubelet  Pulling image "percona/percona-xtradb-cluster-operator:1.12.0-haproxy"                         │
│   Warning  Unhealthy  16m (x460 over 71m)  kubelet  Readiness probe failed: ERROR 2013 (HY000): Lost connection to MySQL server at 'reading initia │
│ l communication packet', system error: 2                                                                                                           │
│   Normal   Pulled     6m35s (x8 over 42m)  kubelet  (combined from similar events): Successfully pulled image "percona/percona-xtradb-cluster-oper │
│ ator:1.12.0-haproxy" in 1.118248913s                                                                                                               │
│   Warning  BackOff    86s (x110 over 47m)  kubelet  Back-off restarting failed container                                                           │
│                                                                                               
  1. Events from the cluster1-pxc-0 pod:
 Events:                                                                                                                                            │
│   Type     Reason     Age                    From     Message                                                                                      │
│   ----     ------     ----                   ----     -------                                                                                      │
│   Normal   Pulling    18m (x11 over 73m)     kubelet  Pulling image "percona/percona-xtradb-cluster:8.0.29-21.1"                                   │
│   Warning  Unhealthy  2m35s (x203 over 72m)  kubelet  Readiness probe failed: ERROR 2003 (HY000): Can't connect to MySQL server on '10.244.2.46:33 │
│ 062' (111)                                                                                                                                         │
│ + [[ '' == \P\r\i\m\a\r\y ]]                                                                                                                       │
│ + exit 1                                                                                                                                           │
│         

I’ve spotted a similar issue here -Fresh instance with Percona XtraDB Cluster Operator v1.8.0 not starting completly under OKD - #21 by pavloos but it was 2 year ago but I was trying everything suggested and it seems that the workaround has already been applied.

I’ve restarted the coredns, installed and installed celium CNI instead of kindnet, but ended up facing the same issues.

Could you please take a look or at least give me some advice on how to debug it further?

Best regards,
Tomek

I was also trying with the minikube 1.28.0 as it is a supported platform Operator for 1.12.0, but the result is the same. Could you please take a look?

Hey @tomku ,

I did the following:

  1. Created clean Ubuntu instance
  2. Installed the latest minikube - 1.29.0
  3. Deployed the operator:
kubectl apply -f https://raw.githubusercontent.com/percona/percona-xtradb-cluster-operator/v1.12.0/deploy/bundle.yaml
  1. Deployed the minimal cluster (as I have 1 node minikube):
kubectl apply -f https://raw.githubusercontent.com/percona/percona-xtradb-cluster-operator/v1.12.0/deploy/cr-minimal.yaml

PXC is up and running.
I used our instruction for minikube: Install on Minikube - Percona Operator for MySQL based on Percona XtraDB Cluster

What steps are you executing? Which manifests do you apply?

Hey @Sergey_Pronin ,

Thank you very much for your response.
I used the very same commands.
To create cluster

minikube start --memory=4096 --cpus=3

Then bundle and cr-minimal.
I’ve also upgraded minikube to 1.29. The result is:

Desribe the minimal-cluster-pxc-0:

Do you have any ideas on how to debug further?

Hello Sergey,

I appreciate your efforts in investigating the matter. I have encountered the very same issue and attempted to replicate your solution by following the instructions outlined in the guide you shared.

To elaborate, I have utilized the latest version of minikube (v.1.29.0), bootstrapped the cluster with the command “minikube start --memory=4096 --cpus=3,” and deployed the operator and percona cluster using the manifest supplied in the documentation. Despite these steps, the error persists.

Runtime
OS:

NAME="Ubuntu"
VERSION="18.04.5 LTS (Bionic Beaver)"

minikube

➜  ~ minikube version                     
minikube version: v1.29.0
commit: ddac20b4b34a9c8c857fc602203b6ba2679794d3

Docker

➜  ~ docker version
Client: Docker Engine - Community
 Version:           20.10.3
 API version:       1.41
 Go version:        go1.13.15
 Git commit:        48d30b5
 Built:             Fri Jan 29 14:33:13 2021
 OS/Arch:           linux/amd64
 Context:           default
 Experimental:      true

@Sergey_Pronin In the meantime, I’ve compared the logs from macOS where everything seems to be fine and Ubuntu.
It seems that for Ubuntu everything goes smoothly until it gets stuck in a wait loop, and then the container is restarted after the timeout - the restart ends up in mysqld starting correctly but without the admin endpoint that is the target of the readiness probe

I’m attaching the logs:
macos_minikub129_crminimal_percona112.log (57.2 KB)
ubuntu22_minikube129_crminimal_percona112.log (19.8 KB)
ubuntu22_minikube129_crminimal_percona112_second_run.log (27.5 KB)

Hey!

Which error do you see?

I checked the logs and don’t see anything critical.

We have a tool called pt-k8s-debug-collector: pt-k8s-debug-collector — Percona Toolkit Documentation

Can you run it and share the results? It will capture all the information about all resources in the cluster related to PXC. It will be easier to analyze.

Thanks for checking!

Hey @Sergey_Pronin , have you had a chance to look at the dump?

Hi,

I am running into the exact same issue just now. I already have a setup of the operator that works on my laptop which I installed with helm. Following the steps from the docs on helm leads me to this issue. Something has changed since Jan 6 2023 when I last had to set it up. I see that there has been a new helm chart version since that date which is v1.12.1. On my running setup, it is v1.12.0.
I tried using that older version just in case but the result is still the same. I also got the exact same cr.yaml file that I used for the successful deployment from git and applied it against both v1.12.0 and v1.12.1 but no luck.
Something weird is happening because based on my version control I am doing the exact same thing with the exact same versions of everything just as I did on Jan 6 2023 but I am getting a different result. Could there have been a change in the helm chart without bumping the version (backporting something idk)?

System:

# uname -a
Linux pop-os 6.0.12-76060006-generic #202212290932~1674139725~22.04~ca93ccf SMP PREEMPT_DYNAMIC Thu J x86_64 x86_64 x86_64 GNU/Linux

Steps to reproduce:
With helm

kind create cluster --config kind-config.yaml
# kind-config.yaml
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
name: redstone
nodes:
  - role: control-plane
    image: kindest/node:v1.23.13
  - role: worker
    image: kindest/node:v1.23.13
  - role: worker
    image: kindest/node:v1.23.13
  - role: worker
    image: kindest/node:v1.23.13
# I have also tested with v1.22.15 -> same issue

I have already added the percona helm repo :slight_smile: and did a helm repo update.

helm install my-xtdb-op percona/pxc-operator

which gets the operator up and running and installs CRDs:

NAME                                       READY   STATUS    RESTARTS   AGE
my-xtdb-op-pxc-operator-85c45b4549-qvn2d   1/1     Running   0          6s

and then:

helm install my-db percona/pxc-db \                           
--set haproxy.enabled=false \
--set proxysql.enabled=true \
--set logcollector.enabled=false

# I am using proxysql but I also got the exact same issue without setting anything custom and using haproxy.

This gets us here:

❯ kubectl get pods,pxc
NAME                                           READY   STATUS    RESTARTS   AGE
pod/my-db-pxc-db-proxysql-0                    3/3     Running   0          115s
pod/my-db-pxc-db-proxysql-1                    3/3     Running   0          107s
pod/my-db-pxc-db-proxysql-2                    3/3     Running   0          97s
pod/my-db-pxc-db-pxc-0                         0/1     Running   0          115s
pod/my-xtdb-op-pxc-operator-85c45b4549-qvn2d   1/1     Running   0          4m33s

NAME                                                ENDPOINT                            STATUS         PXC   PROXYSQL   HAPROXY   AGE
perconaxtradbcluster.pxc.percona.com/my-db-pxc-db   my-db-pxc-db-proxysql-unready.ivo   initializing         3                    116s

and the my-db-pxc-db-pxc-0 pod is spamming:

│ + '[' '' = Synced ']'                                                                                                                                             │
│ + echo 'MySQL init process in progress...'                                                                                                                        │
│ + sleep 1                                                                                                                                                         │
│ MySQL init process in progress...                                                                                                                                 │
│ + for i in {120..0}                                                                                                                                               │
│ ++ echo 'SELECT variable_value FROM performance_schema.global_status WHERE variable_name='\''wsrep_local_state_comment'\'''                                       │
│ ++ mysql --protocol=socket -uroot -hlocalhost --socket=/var/lib/mysql/mysql.sock --password= -s                                                                   │
│ + wsrep_local_state=                                                                                                                                              │
│ MySQL init process in progress...                                                                                                                                 │

After some time or with a kill pod my-db-pxc-db-pxc-0 gets restarted and the logs change. After the first restart the PXC-0 pod is sitting at:

│ 2023-02-28T06:54:37.828344Z 0 [Note] [MY-000000] [Galera] wsrep_load(): loading provider library 'none'                                                           │
│ 2023-02-28T06:54:37.830008Z 0 [System] [MY-011323] [Server] X Plugin ready for connections. Bind-address: '::' port: 33060, socket: /var/lib/mysql/mysqlx.sock    │
│ 2023-02-28T06:54:37.830076Z 0 [System] [MY-010931] [Server] /usr/sbin/mysqld: ready for connections. Version: '8.0.29-21.1'  socket: '/var/lib/mysql/mysql.sock'  │

but kubernetes is not marking the pod as ready due to a failed ready check that looks like this:

│ Events:                                                                                                                                                           │
│   Type     Reason     Age                    From               Message                                                                                           │
│   ----     ------     ----                   ----               -------                                                                                           │
│   Normal   Scheduled  4m34s                  default-scheduler  Successfully assigned ivo/my-db-pxc-db-pxc-0 to redstone-worker3                                  │
│   Normal   Pulling    4m34s                  kubelet            Pulling image "percona/percona-xtradb-cluster-operator:1.12.0"                                    │
│   Normal   Pulled     4m33s                  kubelet            Successfully pulled image "percona/percona-xtradb-cluster-operator:1.12.0" in 800.88027ms         │
│   Normal   Created    4m33s                  kubelet            Created container pxc-init                                                                        │
│   Normal   Started    4m33s                  kubelet            Started container pxc-init                                                                        │
│   Normal   Pulled     4m31s                  kubelet            Successfully pulled image "percona/percona-xtradb-cluster:8.0.29-21.1" in 760.944329ms            │
│   Normal   Pulling    2m11s (x2 over 4m31s)  kubelet            Pulling image "percona/percona-xtradb-cluster:8.0.29-21.1"                                        │
│   Normal   Pulled     2m11s                  kubelet            Successfully pulled image "percona/percona-xtradb-cluster:8.0.29-21.1" in 758.547316ms            │
│   Normal   Created    2m10s (x2 over 4m31s)  kubelet            Created container pxc                                                                             │
│   Normal   Started    2m10s (x2 over 4m30s)  kubelet            Started container pxc                                                                             │
│   Warning  Unhealthy  4s (x10 over 4m4s)     kubelet            Readiness probe failed: ERROR 2003 (HY000): Can't connect to MySQL server on '10.244.1.11:33062'  │
│ (111)                                                                                                                                                             │
│ + [[ '' == \P\r\i\m\a\r\y ]]                                                                                                                                      │
│ + exit 1                                                                                                                                                          │

No luck :frowning:

EDIT 2:
The first and only pxc pod gets stuck right before it needs to open the Admin interface which is at port 33062. The logs on a normal running look something like this:

....
│ 2023-02-28T06:54:37.828344Z 0 [Note] [MY-000000] [Galera] wsrep_load(): loading provider library 'none'                                                           │
│ 2023-02-28T06:54:37.830008Z 0 [System] [MY-011323] [Server] X Plugin ready for connections. Bind-address: '::' port: 33060, socket: /var/lib/mysql/mysqlx.sock    │
│ 2023-02-28T06:54:37.830076Z 0 [System] [MY-010931] [Server] /usr/sbin/mysqld: ready for connections. Version: '8.0.29-21.1'  socket: '/var/lib/mysql/mysql.sock'  │

# This is the key part that we are missing. ofc this log line is from a totally different running one that I am scared to touch atm. 
| 2023-02-28T12:39:11.383051Z 0 [System] [MY-013292] [Server] Admin interface ready for connections, address: '10.244.2.8'  port: 33062
....

Because the PXC pod does not open its Admin interface on port 33062 then the readiness check is not able to pass because it queries the Admin interface. PXC pods are spawned sequentially if the first one

I have the feeling the following is happening but I don’t know why atm:

  1. First PXC pod loops on MySQL init process in progress... and eventually gets killed by k8s
    1.1. Just in case they are related: https://forums.percona.com/t/bug-in-entry-entrypoint-sh-in-docker-image-based-setup/19585/2
  2. On the first restart of the pod because the init process didn’t go nicely it does not open Admin Interface
  3. No Admin Interface on port 33062 → no ready check → no cluster

After this, I went directly from the source and directly from the repo with the tag v1.12.0 just as mentioned in the docs.

git clone -b v1.12.0 https://github.com/percona/percona-xtradb-cluster-operator
kubectl apply -f deploy/bundle.yaml
kubectl apply -f deploy/cr.yaml

Unfortunately, this also leads to the same behaviour as the first time the pod is spamming MySQL init process in progress and after a restart, it gets ready for connections but the ready check from kubernetes does not go through → pod never gets ready etc.

I am a bit at a loss here :frowning:

EDIT 1:
I just tested with minikube instead of kind and I get the same issue. I will now test on a fully-fledged kubernetes cluster on GCP (google) and will keep you posted.