I have tried multiple times to install a fresh and basic Percona XtraDB Cluster Instance with the Operator provided from the operatorhub. Unfortunately without success.
OKD Version: 4.7.0-0.okd-2021-04-24-103438 with OpenShift Container Storage v4.6.4 (latest)
Percona XtraDB Cluster Operator Version: 1.8.0 from operatorhub.io (latest)
After creating the Instance, the pods cluster1-haproxy-0 and cluster1-pxc-0 will not start completely:
oc -n pxc get pods
NAME READY STATUS RESTARTS AGE
cluster1-haproxy-0 1/2 Running 0 112s
cluster1-pxc-0 2/3 Running 0 112s
percona-xtradb-cluster-operator-598bf796f7-5k6jt 1/1 Running 0 19h
oc -n pxc logs cluster1-haproxy-0 pxc-monit
+ '[' /usr/bin/peer-list = haproxy ']'
+ exec /usr/bin/peer-list -on-change=/usr/bin/add_pxc_nodes.sh -service=cluster1-pxc
2021/05/11 09:26:54 Peer finder enter
2021/05/11 09:26:54 Determined Domain to be pxc.svc.cluster.local
2021/05/11 09:26:54 No on-start supplied, on-change /usr/bin/add_pxc_nodes.sh will be applied on start.
2021/05/11 09:26:54 lookup cluster1-pxc on 10.30.0.10:53: no such host
2021/05/11 09:26:55 lookup cluster1-pxc on 10.30.0.10:53: no such host
2021/05/11 09:26:56 lookup cluster1-pxc on 10.30.0.10:53: no such host
2021/05/11 09:26:57 lookup cluster1-pxc on 10.30.0.10:53: no such host
2021/05/11 09:26:58 lookup cluster1-pxc on 10.30.0.10:53: no such host
2021/05/11 09:26:59 lookup cluster1-pxc on 10.30.0.10:53: no such host
2021/05/11 09:27:00 lookup cluster1-pxc on 10.30.0.10:53: no such host
2021/05/11 09:27:01 lookup cluster1-pxc on 10.30.0.10:53: no such host
2021/05/11 09:27:02 lookup cluster1-pxc on 10.30.0.10:53: no such host
2021/05/11 09:27:03 lookup cluster1-pxc on 10.30.0.10:53: no such host
2021/05/11 09:27:05 lookup cluster1-pxc on 10.30.0.10:53: no such host
c -n pxc get events
LAST SEEN TYPE REASON OBJECT MESSAGE
23m Warning Unhealthy pod/cluster1-haproxy-0 Readiness probe failed: ERROR 2013 (HY000): Lost connection to MySQL server at 'reading initial communication packet', system error: 2
27m Warning Unhealthy pod/cluster1-haproxy-0 Liveness probe failed: ERROR 2013 (HY000): Lost connection to MySQL server at 'reading initial communication packet', system error: 2
18m Warning Unhealthy pod/cluster1-haproxy-0 Readiness probe errored: rpc error: code = NotFound desc = container is not created or running: checking if PID of ba67b440dbfab358bf8c4ca5015898b0c3b113d0b2bd652affa59ff5040860d4 is running failed: container process not found
27m Warning Unhealthy pod/cluster1-pxc-0 Readiness probe failed: ERROR 2003 (HY000): Can't connect to MySQL server on 'cluster1-pxc-0' (111)
+ [[ '' == \P\r\i\m\a\r\y ]]
+ exit 1
23m Warning Unhealthy pod/cluster1-pxc-0 Readiness probe failed: ERROR 1045 (28000): Access denied for user 'monitor'@'cluster1-pxc-0.cluster1-pxc.pxc.svc.cluster.local' (using password: YES)
+ [[ '' == \P\r\i\m\a\r\y ]]
+ exit 1
22m Warning Unhealthy pod/cluster1-pxc-0 Liveness probe failed: ERROR 1045 (28000): Access denied for user 'monitor'@'cluster1-pxc-0.cluster1-pxc.pxc.svc.cluster.local' (using password: YES)
+ [[ -n '' ]]
+ exit 1
18m Warning FailedToUpdateEndpoint endpoints/cluster1-pxc-unready Failed to update endpoint pxc/cluster1-pxc-unready: Operation cannot be fulfilled on endpoints "cluster1-pxc-unready": the object has been modified; please apply your changes to the latest version and try again
I removed the old pxc namespace and also the Operator
After I created a new namespace:
oc create namespace pxc-new
namespace/pxc-new created
and used this new namespace for the fresh Operator installation, it seems to work:
I am running into the same issue. We evaluating the operator and are doing a vanila deployment on a charmed kubernetes cluster with rook-ceph. Initially we had modified secrets.yaml as per the document and then did a kubectl -n pvx apply -f secrets.yaml but haproxy did not start with Warning Unhealthy 5m3s kubelet Readiness probe failed: ERROR 2003 (HY000): Can’t connect to MySQL server on ‘cluster1-pxc-0’ (111)
We delete everything including the namspace and created a new one called pxcluster and then ran everything again with out applying the secrets.yaml file.
However when we run kubectl -n pxcluster get pods we get:
NAME READY STATUS RESTARTS AGE
cluster1-haproxy-0 2/2 Running 0 10m
cluster1-haproxy-1 1/2 Running 3 9m22s
cluster1-pxc-0 3/3 Running 0 10m
cluster1-pxc-1 2/3 Running 1 9m28s
percona-xtradb-cluster-operator-77bfd8cdc5-psrpb 1/1 Running 0 11m
When we describe the haproxy and cluster node we see the following:
ype Reason Age From Message
Warning FailedScheduling 5m53s (x2 over 5m53s) default-scheduler 0/3 nodes are available: 3 pod has unbound immediate PersistentVolumeClaims.
Normal Scheduled 5m50s default-scheduler Successfully assigned pxcluster/cluster1-pxc-0 to k8s-node-3
Normal SuccessfulAttachVolume 5m50s attachdetach-controller AttachVolume.Attach succeeded for volume “pvc-8617f9c4-d5d5-43f4-af54-54e685b17bac”
Normal Pulling 5m47s kubelet Pulling image “percona/percona-xtradb-cluster-operator:1.8.0”
Normal Started 5m46s kubelet Started container pxc-init
Normal Created 5m46s kubelet Created container pxc-init
Normal Pulled 5m46s kubelet Successfully pulled image “percona/percona-xtradb-cluster-operator:1.8.0” in 1.298818047s
Normal Pulling 5m45s kubelet Pulling image “percona/percona-xtradb-cluster-operator:1.8.0-logcollector”
Normal Pulling 5m44s kubelet Pulling image “percona/percona-xtradb-cluster-operator:1.8.0-logcollector”
Normal Pulled 5m44s kubelet Successfully pulled image “percona/percona-xtradb-cluster-operator:1.8.0-logcollector” in 1.294141746s
Normal Created 5m44s kubelet Created container logs
Normal Started 5m44s kubelet Started container logs
Normal Pulled 5m42s kubelet Successfully pulled image “percona/percona-xtradb-cluster-operator:1.8.0-logcollector” in 1.351197875s
Normal Created 5m42s kubelet Created container logrotate
Normal Started 5m42s kubelet Started container logrotate
Normal Pulling 5m42s kubelet Pulling image “percona/percona-xtradb-cluster:8.0.22-13.1”
Normal Pulled 5m41s kubelet Successfully pulled image “percona/percona-xtradb-cluster:8.0.22-13.1” in 1.318286907s
Normal Created 5m41s kubelet Created container pxc
Normal Started 5m41s kubelet Started container pxc
Warning Unhealthy 5m3s kubelet Readiness probe failed: ERROR 2003 (HY000): Can’t connect to MySQL server on ‘cluster1-pxc-0’ (111)
As I can see you have two issues there. One issue is that your HAProxy ‘cluster1-haproxy-1’ pod can’t connect to cluster1-pxc-0 and another one is that pxc container on pod cluster1-pxc-1 can not start (join to the cluster) . Please make sure that you don’t have any communication/network (all needed ports are opened, IPs are reachable and so on ) issues between k8s nodes.
Thanks for the reploy @Slava_Sarzhan . I do not have any communication issues as far as I can see. Flannel with calico are up and both working and my ceph cluster is detecting heart beats from all nodes. Also other pods are working right.
Do I need configure EmptyDir on haproxy? Could that be it?
Readiness probe failed: ERROR 2013 (HY000): Lost connection to MySQL server at 'reading initial communication packet', system error: 2
Back-off restarting failed container
I can reproduce it using command provided by you. The root of the issue is that cluster1-pxc-0/cluster1-haproxy-0 pods can’t resolve services like cluster1-pxc-unready. That is why operator can’t configure the cluster in a proper way. It is calico issue. As I can see calico v3.14.1 is used by minikube and it was released more than one year ago. I have installed the latest v3.19.1 calico using official documentation Quickstart for Calico on minikube (using Manifest ) and issue has gone:
>kubectl get pods -l k8s-app=calico-node -n kube-system
NAME READY STATUS RESTARTS AGE
calico-node-fkwnn 1/1 Running 0 20m
calico-node-mk8dx 1/1 Running 0 19m
calico-node-z29f5 1/1 Running 0 18m
> kubectl get pods
NAME READY STATUS RESTARTS AGE
cluster1-haproxy-0 2/2 Running 0 5m32s
cluster1-haproxy-1 2/2 Running 0 3m30s
cluster1-haproxy-2 2/2 Running 0 3m4s
cluster1-pxc-0 3/3 Running 0 5m32s
cluster1-pxc-1 3/3 Running 0 3m29s
cluster1-pxc-2 3/3 Running 0 117s
percona-xtradb-cluster-operator-d99c748-jhv4x 1/1 Running 0 6m16s
Also, I have tested it on scaleway k8s cluster with CNI calico and it also works. Try to use the latest version of calico and inform me about the results.
@Slava_Sarzhan so i managed to get this working. The issue originally was that I was trying to define secrets instead of allowing percona to create them for itself. Things worked and I moved on. I came back today to do some maintenance and noticed that issue had come back.
I am using calico v3.19.1 as shown below.
kubectl calico version
Client Version: v3.19.1
Git commit: 6fc0db96
Unable to retrieve Cluster Version or Type: resource does not exist: ClusterInformation(default) with error: the server could not find the requested resource (get ClusterInformations.crd.projectcalico.org default)
I did some more digging in the logs and found the following. It looks likes there was an attempt by galera to open a connection and that failed.
[0] pxcluster.cluster1-pxc-1.mysqld-error.log: [1627130432.413401552, {“log”=>“2021-07-24T12:40:32.412880Z 0 [Warning] [MY-000000] [Galera] last inactive check more than PT1.5S (3*evs.inactive_check_period) ago (PT3.50417S), skipping check”}]
[0] pxcluster.cluster1-pxc-1.mysqld-error.log: [1627130461.921485610, {“log”=>“2021-07-24T12:41:01.920841Z 0 [Note] [MY-000000] [Galera] PC protocol downgrade 1 → 0”}]
[1] pxcluster.cluster1-pxc-1.mysqld-error.log: [1627130461.921909299, {“log”=>“2021-07-24T12:41:01.921460Z 0 [Note] [MY-000000] [Galera] Current view of cluster as seen by this node”}]
[2] pxcluster.cluster1-pxc-1.mysqld-error.log: [1627130461.921911826, {“log”=>“view ((empty))”}]
[3] pxcluster.cluster1-pxc-1.mysqld-error.log: [1627130461.922410405, {“log”=>“2021-07-24T12:41:01.922374Z 0 [ERROR] [MY-000000] [Galera] failed to open gcomm backend connection: 110: failed to reach primary view (pc.wait_prim_timeout): 110 (Connection timed out)”}]
[4] pxcluster.cluster1-pxc-1.mysqld-error.log: [1627130461.922412800, {“log”=>" at gcomm/src/pc.cpp:connect():161"}]
[5] pxcluster.cluster1-pxc-1.mysqld-error.log: [1627130461.922487606, {“log”=>“2021-07-24T12:41:01.922428Z 0 [ERROR] [MY-000000] [Galera] gcs/src/gcs_core.cpp:gcs_core_open():220: Failed to open backend connection: -110 (Connection timed out)”}]
[0] pxcluster.cluster1-pxc-1.mysqld-error.log: [1627130462.922868257, {“log”=>“2021-07-24T12:41:02.922714Z 0 [Note] [MY-000000] [Galera] gcomm: terminating thread”}]
[1] pxcluster.cluster1-pxc-1.mysqld-error.log: [1627130462.922874019, {“log”=>“2021-07-24T12:41:02.922822Z 0 [Note] [MY-000000] [Galera] gcomm: joining thread”}]
[2] pxcluster.cluster1-pxc-1.mysqld-error.log: [1627130462.923158562, {“log”=>“2021-07-24T12:41:02.923073Z 0 [ERROR] [MY-000000] [Galera] gcs/src/gcs.cpp:gcs_open():1754: Failed to open channel ‘cluster1-pxc’ at ‘gcomm://10.1.86.126’: -110 (Connection timed out)”}]
[3] pxcluster.cluster1-pxc-1.mysqld-error.log: [1627130462.923258685, {“log”=>“2021-07-24T12:41:02.923175Z 0 [ERROR] [MY-000000] [Galera] gcs connect failed: Connection timed out”}]
[4] pxcluster.cluster1-pxc-1.mysqld-error.log: [1627130462.923260499, {“log”=>“2021-07-24T12:41:02.923219Z 0 [ERROR] [MY-000000] [WSREP] Provider/Node (gcomm://10.1.86.126) failed to establish connection with cluster (reason: 7)”}]
[5] pxcluster.cluster1-pxc-1.mysqld-error.log: [1627130462.923345885, {“log”=>“2021-07-24T12:41:02.923255Z 0 [ERROR] [MY-010119] [Server] Aborting”}]
[6] pxcluster.cluster1-pxc-1.mysqld-error.log: [1627130462.923724901, {“log”=>“2021-07-24T12:41:02.923666Z 0 [System] [MY-010910] [Server] /usr/sbin/mysqld: Shutdown complete (mysqld 8.0.22-13.1) Percona XtraDB Cluster (GPL), Release rel13, Revision a48e6d5, WSREP version 26.4.3.”}]
[7] pxcluster.cluster1-pxc-1.mysqld-error.log: [1627130462.924285826, {“log”=>“2021-07-24T12:41:02.924248Z 0 [Note] [MY-000000] [Galera] dtor state: CLOSED”}]
[8] pxcluster.cluster1-pxc-1.mysqld-error.log: [1627130462.924356763, {“log”=>“2021-07-24T12:41:02.924329Z 0 [Note] [MY-000000] [Galera] MemPool(TrxHandleSlave): hit ratio: 0, misses: 0, in use: 0, in pool: 0”}]
Not sure I see the issue. Normally you have to specify the name space of the pod to resolve it and in this case dnsutils us running in the default name space while percona cluster is running in pxcluster name space.