Fresh instance with Percona XtraDB Cluster Operator v1.8.0 not starting completly under OKD

I have tried multiple times to install a fresh and basic Percona XtraDB Cluster Instance with the Operator provided from the operatorhub. Unfortunately without success.

OKD Version: 4.7.0-0.okd-2021-04-24-103438 with OpenShift Container Storage v4.6.4 (latest)
Percona XtraDB Cluster Operator Version: 1.8.0 from operatorhub.io (latest)

Steps to reproduce:

oc create namespace pxc
(
cat <<EOF
apiVersion: v1
kind: Secret
metadata:
  name: my-cluster-secrets
  namespace: pxc
type: Opaque
data:
  root: dHU3ZWl6aWVuNEVldGg3ZGFlbjZtaWV5aQo=
  xtrabackup: Ym9vTDN2YWhkMG5pRjVBZVBoMGVlamFWaQo=
  monitor: QWVwaGVlMGVleDJRdW9oc2hlaWM5U2VlNAo=
  clustercheck: b293aWV4YTZldTBlb2owYWNvb2NhM0VjaAo=
  proxyadmin: SWUzU2hhaDV0aGlleGVpRDFzaGllaDRBaQo=
  pmmserver: RmFpTmdpazFvaG43YWVuaWVmZXVwYWhzaAo=
  operator: WWVuYWlsYWlQb1dpdUhlaWdlZTFvZXZpZQo=
EOF
 ) | oc create -f -

OperatorHub > Install Percona XtraDB Cluster Operator (v1.8.0)

Installed Operators > Percona XtraDB Cluster Operator → PerconaXtraDBCluster > Create PerconaXtraDBCluster > Name: cluster1 > Create

After creating the Instance, the pods cluster1-haproxy-0 and cluster1-pxc-0 will not start completely:

oc -n pxc get pods
NAME                                               READY   STATUS    RESTARTS   AGE
cluster1-haproxy-0                                 1/2     Running   0          112s
cluster1-pxc-0                                     2/3     Running   0          112s
percona-xtradb-cluster-operator-598bf796f7-5k6jt   1/1     Running   0          19h

oc -n pxc logs cluster1-haproxy-0 pxc-monit
+ '[' /usr/bin/peer-list = haproxy ']'
+ exec /usr/bin/peer-list -on-change=/usr/bin/add_pxc_nodes.sh -service=cluster1-pxc
2021/05/11 09:26:54 Peer finder enter
2021/05/11 09:26:54 Determined Domain to be pxc.svc.cluster.local
2021/05/11 09:26:54 No on-start supplied, on-change /usr/bin/add_pxc_nodes.sh will be applied on start.
2021/05/11 09:26:54 lookup cluster1-pxc on 10.30.0.10:53: no such host
2021/05/11 09:26:55 lookup cluster1-pxc on 10.30.0.10:53: no such host
2021/05/11 09:26:56 lookup cluster1-pxc on 10.30.0.10:53: no such host
2021/05/11 09:26:57 lookup cluster1-pxc on 10.30.0.10:53: no such host
2021/05/11 09:26:58 lookup cluster1-pxc on 10.30.0.10:53: no such host
2021/05/11 09:26:59 lookup cluster1-pxc on 10.30.0.10:53: no such host
2021/05/11 09:27:00 lookup cluster1-pxc on 10.30.0.10:53: no such host
2021/05/11 09:27:01 lookup cluster1-pxc on 10.30.0.10:53: no such host
2021/05/11 09:27:02 lookup cluster1-pxc on 10.30.0.10:53: no such host
2021/05/11 09:27:03 lookup cluster1-pxc on 10.30.0.10:53: no such host
2021/05/11 09:27:05 lookup cluster1-pxc on 10.30.0.10:53: no such host

c -n pxc get events
LAST SEEN   TYPE      REASON                   OBJECT                                 MESSAGE
23m         Warning   Unhealthy                pod/cluster1-haproxy-0                 Readiness probe failed: ERROR 2013 (HY000): Lost connection to MySQL server at 'reading initial communication packet', system error: 2
27m         Warning   Unhealthy                pod/cluster1-haproxy-0                 Liveness probe failed: ERROR 2013 (HY000): Lost connection to MySQL server at 'reading initial communication packet', system error: 2
18m         Warning   Unhealthy                pod/cluster1-haproxy-0                 Readiness probe errored: rpc error: code = NotFound desc = container is not created or running: checking if PID of ba67b440dbfab358bf8c4ca5015898b0c3b113d0b2bd652affa59ff5040860d4 is running failed: container process not found
27m         Warning   Unhealthy                pod/cluster1-pxc-0                     Readiness probe failed: ERROR 2003 (HY000): Can't connect to MySQL server on 'cluster1-pxc-0' (111)
+ [[ '' == \P\r\i\m\a\r\y ]]
+ exit 1
23m         Warning   Unhealthy                pod/cluster1-pxc-0                     Readiness probe failed: ERROR 1045 (28000): Access denied for user 'monitor'@'cluster1-pxc-0.cluster1-pxc.pxc.svc.cluster.local' (using password: YES)
+ [[ '' == \P\r\i\m\a\r\y ]]
+ exit 1
22m         Warning   Unhealthy                pod/cluster1-pxc-0                     Liveness probe failed: ERROR 1045 (28000): Access denied for user 'monitor'@'cluster1-pxc-0.cluster1-pxc.pxc.svc.cluster.local' (using password: YES)
+ [[ -n '' ]]
+ exit 1
18m         Warning   FailedToUpdateEndpoint            endpoints/cluster1-pxc-unready                 Failed to update endpoint pxc/cluster1-pxc-unready: Operation cannot be fulfilled on endpoints "cluster1-pxc-unready": the object has been modified; please apply your changes to the latest version and try again

Do you have any ideas to resolve this issue?

Hi,

Can you try without creating the my-cluster-secrets on a new namespace? Operator should automatically create secrets for you.

2 Likes

I deleted the my-cluster-secrets (the creation of the secrets is shown in the readme at OperatorHub.io | The registry for Kubernetes Operators) and reinstalled the operator.

After create a new instance I see the same result.

oc -n pxc get pods
NAME                                               READY   STATUS    RESTARTS   AGE
cluster1-haproxy-0                                 1/2     Running   0          106s
cluster1-pxc-0                                     2/3     Running   0          106s
percona-xtradb-cluster-operator-669c94886f-g44cs   1/1     Running   0          2m43s

Instance status in PerconaXtraDBClusters is still “State: initializing”.

The pxc-monit container shows again much of those lines:
2021/05/11 10:49:12 lookup cluster1-pxc on 10.30.0.10:53: no such host

So, no changes at all without the my-cluster-secrets secret. :frowning:

1 Like

Yes, it seems something is wrong. Could you please try again on a clean namespace?

oc create namespace pxc-new
1 Like

I removed the old pxc namespace and also the Operator

After I created a new namespace:
oc create namespace pxc-new
namespace/pxc-new created
and used this new namespace for the fresh Operator installation, it seems to work:

oc -n pxc-new get pods
NAME                                               READY   STATUS    RESTARTS   AGE
cluster1-haproxy-0                                 2/2     Running   0          6m27s
cluster1-haproxy-1                                 2/2     Running   0          4m45s
cluster1-haproxy-2                                 2/2     Running   0          4m11s
cluster1-pxc-0                                     3/3     Running   0          6m27s
cluster1-pxc-1                                     3/3     Running   0          4m45s
cluster1-pxc-2                                     3/3     Running   0          3m21s
percona-xtradb-cluster-operator-6ff787986b-gdcdl   1/1     Running   0          7m17s

Now the instance state is “State: ready”.

Thanks!
The issue seems to be resolved.

1 Like

I am running into the same issue. We evaluating the operator and are doing a vanila deployment on a charmed kubernetes cluster with rook-ceph. Initially we had modified secrets.yaml as per the document and then did a kubectl -n pvx apply -f secrets.yaml but haproxy did not start with Warning Unhealthy 5m3s kubelet Readiness probe failed: ERROR 2003 (HY000): Can’t connect to MySQL server on ‘cluster1-pxc-0’ (111)

We delete everything including the namspace and created a new one called pxcluster and then ran everything again with out applying the secrets.yaml file.

However when we run kubectl -n pxcluster get pods we get:
NAME READY STATUS RESTARTS AGE
cluster1-haproxy-0 2/2 Running 0 10m
cluster1-haproxy-1 1/2 Running 3 9m22s
cluster1-pxc-0 3/3 Running 0 10m
cluster1-pxc-1 2/3 Running 1 9m28s
percona-xtradb-cluster-operator-77bfd8cdc5-psrpb 1/1 Running 0 11m

When we describe the haproxy and cluster node we see the following:
ype Reason Age From Message


Warning FailedScheduling 5m53s (x2 over 5m53s) default-scheduler 0/3 nodes are available: 3 pod has unbound immediate PersistentVolumeClaims.
Normal Scheduled 5m50s default-scheduler Successfully assigned pxcluster/cluster1-pxc-0 to k8s-node-3
Normal SuccessfulAttachVolume 5m50s attachdetach-controller AttachVolume.Attach succeeded for volume “pvc-8617f9c4-d5d5-43f4-af54-54e685b17bac”
Normal Pulling 5m47s kubelet Pulling image “percona/percona-xtradb-cluster-operator:1.8.0”
Normal Started 5m46s kubelet Started container pxc-init
Normal Created 5m46s kubelet Created container pxc-init
Normal Pulled 5m46s kubelet Successfully pulled image “percona/percona-xtradb-cluster-operator:1.8.0” in 1.298818047s
Normal Pulling 5m45s kubelet Pulling image “percona/percona-xtradb-cluster-operator:1.8.0-logcollector”
Normal Pulling 5m44s kubelet Pulling image “percona/percona-xtradb-cluster-operator:1.8.0-logcollector”
Normal Pulled 5m44s kubelet Successfully pulled image “percona/percona-xtradb-cluster-operator:1.8.0-logcollector” in 1.294141746s
Normal Created 5m44s kubelet Created container logs
Normal Started 5m44s kubelet Started container logs
Normal Pulled 5m42s kubelet Successfully pulled image “percona/percona-xtradb-cluster-operator:1.8.0-logcollector” in 1.351197875s
Normal Created 5m42s kubelet Created container logrotate
Normal Started 5m42s kubelet Started container logrotate
Normal Pulling 5m42s kubelet Pulling image “percona/percona-xtradb-cluster:8.0.22-13.1”
Normal Pulled 5m41s kubelet Successfully pulled image “percona/percona-xtradb-cluster:8.0.22-13.1” in 1.318286907s
Normal Created 5m41s kubelet Created container pxc
Normal Started 5m41s kubelet Started container pxc
Warning Unhealthy 5m3s kubelet Readiness probe failed: ERROR 2003 (HY000): Can’t connect to MySQL server on ‘cluster1-pxc-0’ (111)

  • [[ ‘’ == \P\r\i\m\a\r\y ]]
  • exit 1
1 Like

Hi @Seedy_Bensouda ,

As I can see you have two issues there. One issue is that your HAProxy ‘cluster1-haproxy-1’ pod can’t connect to cluster1-pxc-0 and another one is that pxc container on pod cluster1-pxc-1 can not start (join to the cluster) . Please make sure that you don’t have any communication/network (all needed ports are opened, IPs are reachable and so on ) issues between k8s nodes.

1 Like

Thanks for the reploy @SlavaSarzhan . I do not have any communication issues as far as I can see. Flannel with calico are up and both working and my ceph cluster is detecting heart beats from all nodes. Also other pods are working right.

Do I need configure EmptyDir on haproxy? Could that be it?

1 Like

I don’t believe EmptyDir has anything to do here. HAProxy pods are stateless, so it should not be an issue.

Is there anything else specific about your k8s cluster or Operator configuration?

1 Like

Hello, i have the same issue, instance installed following this page: Install Percona XtraDB Cluster on Kubernetes

in the yaml files i edited only storageClass name

pod can’t complete running status
NAME READY STATUS RESTARTS AGE
cluster1-haproxy-0 2/2 Running 1 16m
cluster1-haproxy-1 2/2 Running 0 11m
cluster1-haproxy-2 2/2 Running 0 11m
cluster1-pxc-0 3/3 Running 0 16m
cluster1-pxc-1 2/3 Running 0 11m
percona-xtradb-cluster-operator-77bfd8cdc5-5c9xm 1/1 Running 0 17m

12m Warning Unhealthy pod/cluster1-pxc-1 Readiness probe failed: ERROR 2003 (HY000): Can’t connect to MySQL server on ‘cluster1-pxc-1’ (111)

already tried to install changing namespace, with and without create secrets…

kubernetes v1.20.6 on rancher v2.5.7

1 Like

@MarcoFan anything in the logs of the Operator and Pods?
Is the 3rd PXC pod starting at all?

1 Like

i have the same error if i install calico

minikube start --driver=virtualbox --disable-driver-mounts --cpus=12 --memory=16096  --network-plugin=cni --cni=calico --nodes=3
Readiness probe failed: ERROR 2003 (HY000): Can't connect to MySQL server on 'my-db-pxc-db-pxc-0' (111) + [[ '' == \P\r\i\m\a\r\y ]] + exit 1 

logs-from-logs-in-cluster1-pxc-0.txt (929 Bytes)
logs-from-haproxy-in-cluster1-haproxy-0.txt (647 Bytes)

 Readiness probe failed: ERROR 2013 (HY000): Lost connection to MySQL server at 'reading initial communication packet', system error: 2
Back-off restarting failed container




1 Like

if no calico
percona-xtradb-cluster works

1 Like

Hi @andrey

I can reproduce it using command provided by you. The root of the issue is that cluster1-pxc-0/cluster1-haproxy-0 pods can’t resolve services like cluster1-pxc-unready. That is why operator can’t configure the cluster in a proper way. It is calico issue. As I can see calico v3.14.1 is used by minikube and it was released more than one year ago. I have installed the latest v3.19.1 calico using official documentation Quickstart for Calico on minikube (using Manifest ) and issue has gone:

>kubectl get pods -l k8s-app=calico-node -n kube-system
NAME                READY   STATUS    RESTARTS   AGE
calico-node-fkwnn   1/1     Running   0          20m
calico-node-mk8dx   1/1     Running   0          19m
calico-node-z29f5   1/1     Running   0          18m

> kubectl get pods
NAME                                            READY   STATUS    RESTARTS   AGE
cluster1-haproxy-0                              2/2     Running   0          5m32s
cluster1-haproxy-1                              2/2     Running   0          3m30s
cluster1-haproxy-2                              2/2     Running   0          3m4s
cluster1-pxc-0                                  3/3     Running   0          5m32s
cluster1-pxc-1                                  3/3     Running   0          3m29s
cluster1-pxc-2                                  3/3     Running   0          117s
percona-xtradb-cluster-operator-d99c748-jhv4x   1/1     Running   0          6m16s

Also, I have tested it on scaleway k8s cluster with CNI calico and it also works. Try to use the latest version of calico and inform me about the results.

1 Like

@SlavaSarzhan so i managed to get this working. The issue originally was that I was trying to define secrets instead of allowing percona to create them for itself. Things worked and I moved on. I came back today to do some maintenance and noticed that issue had come back.

NAME READY STATUS RESTARTS AGE
68e50-daily-backup-1627084800-ps25h 0/1 Completed 0 12h
68e50-sat-night-backup-1627084800-q2wx6 0/1 Completed 0 12h
cluster1-haproxy-0 1/2 Running 6 19m
cluster1-haproxy-1 1/2 CrashLoopBackOff 11903 53d
cluster1-haproxy-2 1/2 CrashLoopBackOff 11903 53d
cluster1-pxc-0 3/3 Running 19 53d
cluster1-pxc-1 2/3 CrashLoopBackOff 7 20m
percona-xtradb-cluster-operator-77bfd8cdc5-9r6vr 1/1 Running 1 53d
xb-cron-cluster1-s3-us-west-20210605000008-3d2dv-hxwjd 0/1 CreateContainerConfigError 0 45d

I am using calico v3.19.1 as shown below.
kubectl calico version
Client Version: v3.19.1
Git commit: 6fc0db96
Unable to retrieve Cluster Version or Type: resource does not exist: ClusterInformation(default) with error: the server could not find the requested resource (get ClusterInformations.crd.projectcalico.org default)

I did some more digging in the logs and found the following. It looks likes there was an attempt by galera to open a connection and that failed.

[0] pxcluster.cluster1-pxc-1.mysqld-error.log: [1627130432.413401552, {“log”=>“2021-07-24T12:40:32.412880Z 0 [Warning] [MY-000000] [Galera] last inactive check more than PT1.5S (3*evs.inactive_check_period) ago (PT3.50417S), skipping check”}]
[0] pxcluster.cluster1-pxc-1.mysqld-error.log: [1627130461.921485610, {“log”=>“2021-07-24T12:41:01.920841Z 0 [Note] [MY-000000] [Galera] PC protocol downgrade 1 → 0”}]
[1] pxcluster.cluster1-pxc-1.mysqld-error.log: [1627130461.921909299, {“log”=>“2021-07-24T12:41:01.921460Z 0 [Note] [MY-000000] [Galera] Current view of cluster as seen by this node”}]
[2] pxcluster.cluster1-pxc-1.mysqld-error.log: [1627130461.921911826, {“log”=>“view ((empty))”}]
[3] pxcluster.cluster1-pxc-1.mysqld-error.log: [1627130461.922410405, {“log”=>“2021-07-24T12:41:01.922374Z 0 [ERROR] [MY-000000] [Galera] failed to open gcomm backend connection: 110: failed to reach primary view (pc.wait_prim_timeout): 110 (Connection timed out)”}]
[4] pxcluster.cluster1-pxc-1.mysqld-error.log: [1627130461.922412800, {“log”=>" at gcomm/src/pc.cpp:connect():161"}]
[5] pxcluster.cluster1-pxc-1.mysqld-error.log: [1627130461.922487606, {“log”=>“2021-07-24T12:41:01.922428Z 0 [ERROR] [MY-000000] [Galera] gcs/src/gcs_core.cpp:gcs_core_open():220: Failed to open backend connection: -110 (Connection timed out)”}]
[0] pxcluster.cluster1-pxc-1.mysqld-error.log: [1627130462.922868257, {“log”=>“2021-07-24T12:41:02.922714Z 0 [Note] [MY-000000] [Galera] gcomm: terminating thread”}]
[1] pxcluster.cluster1-pxc-1.mysqld-error.log: [1627130462.922874019, {“log”=>“2021-07-24T12:41:02.922822Z 0 [Note] [MY-000000] [Galera] gcomm: joining thread”}]
[2] pxcluster.cluster1-pxc-1.mysqld-error.log: [1627130462.923158562, {“log”=>“2021-07-24T12:41:02.923073Z 0 [ERROR] [MY-000000] [Galera] gcs/src/gcs.cpp:gcs_open():1754: Failed to open channel ‘cluster1-pxc’ at ‘gcomm://10.1.86.126’: -110 (Connection timed out)”}]
[3] pxcluster.cluster1-pxc-1.mysqld-error.log: [1627130462.923258685, {“log”=>“2021-07-24T12:41:02.923175Z 0 [ERROR] [MY-000000] [Galera] gcs connect failed: Connection timed out”}]
[4] pxcluster.cluster1-pxc-1.mysqld-error.log: [1627130462.923260499, {“log”=>“2021-07-24T12:41:02.923219Z 0 [ERROR] [MY-000000] [WSREP] Provider/Node (gcomm://10.1.86.126) failed to establish connection with cluster (reason: 7)”}]
[5] pxcluster.cluster1-pxc-1.mysqld-error.log: [1627130462.923345885, {“log”=>“2021-07-24T12:41:02.923255Z 0 [ERROR] [MY-010119] [Server] Aborting”}]
[6] pxcluster.cluster1-pxc-1.mysqld-error.log: [1627130462.923724901, {“log”=>“2021-07-24T12:41:02.923666Z 0 [System] [MY-010910] [Server] /usr/sbin/mysqld: Shutdown complete (mysqld 8.0.22-13.1) Percona XtraDB Cluster (GPL), Release rel13, Revision a48e6d5, WSREP version 26.4.3.”}]
[7] pxcluster.cluster1-pxc-1.mysqld-error.log: [1627130462.924285826, {“log”=>“2021-07-24T12:41:02.924248Z 0 [Note] [MY-000000] [Galera] dtor state: CLOSED”}]
[8] pxcluster.cluster1-pxc-1.mysqld-error.log: [1627130462.924356763, {“log”=>“2021-07-24T12:41:02.924329Z 0 [Note] [MY-000000] [Galera] MemPool(TrxHandleSlave): hit ratio: 0, misses: 0, in use: 0, in pool: 0”}]

1 Like

Thanks for the quick response. The problem was with coredns on minikube. I created a simple Pod to use as a test environment

apiVersion: v1
kind: Pod
metadata:
  name: dnsutils
  namespace: default
spec:
  containers:
  - name: dnsutils
    image: gcr.io/kubernetes-e2e-test-images/dnsutils:1.3
    command:
      - sleep
      - "3600"
    imagePullPolicy: IfNotPresent
  restartPolicy: Always
  nodeName: minikube-m02

Installed it on minikube-m02 to check DNS and I found out that dns does not work on minikube-m02
no found kubernetes.default


no found cluster1-pxc-0

I just reloaded CoreDNS on minikube

└$► kubectl get pods --namespace=kube-system -l k8s-app=kube-dns
NAME                      READY   STATUS    RESTARTS   AGE
coredns-74ff55c5b-hfklv   1/1     Running   0          38m
└$► kubectl delete coredns-74ff55c5b-hfklv -n kube-system

Percona XtraDB Cluster on Minikube (minikube start --driver=virtualbox --disable-driver-mounts --cpus=12 --memory=16096 --network-plugin=cni --cni=calico --nodes=3) is working

Just restart CoreDNS on minikube

It makes no difference to run the minicube on three nodes at once or add one at a time. CoreDNS work only minikube-m00

2 Likes

@Andrey just to give more insite. I am using ubuntu with charmed kubernetes. I check and my DNS server is working as expected. dns test is as follows:

kubectl exec -i -t dnsutils – nslookup cluster1-pxc.pxcluster
Server: 10.152.183.14
Address: 10.152.183.14#53

Name: cluster1-pxc.pxcluster.svc.cluster.local
Address: 10.1.86.126

1 Like

I don’t have much experience, but it seems to me that you still have a problem with DNS.

└$► kubectl exec -i -t dnsutils -- nslookup cluster1-pxc
Server:		10.96.0.10
Address:	10.96.0.10#53

Name:	cluster1-pxc.default.svc.cluster.local
Address: 10.244.205.194
Name:	cluster1-pxc.default.svc.cluster.local
Address: 10.244.151.3
Name:	cluster1-pxc.default.svc.cluster.local
Address: 10.244.120.66
1 Like

Not sure I see the issue. Normally you have to specify the name space of the pod to resolve it and in this case dnsutils us running in the default name space while percona cluster is running in pxcluster name space.

1 Like

is there any further suggestion on this that i can try?

1 Like