Random pxc node fails with gcomm issues

Ilford · January 10, 2023, 10:17am

Hello,

I’m deploying on my k8s cluster the pxc-opertor and pxc-db helm charts, both in version 1.12.0.

It works great but after some time one of the cluster pod crashes and in the logs I have this issues :

[ERROR] [MY-000000] [Galera] failed to open gcomm backend connection: 110: failed to reach primary view (pc.wait_prim_timeout): 110 (Connection timed out)\n\t at gcomm/src/pc.cpp:connect():161\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2023-01-10T10:11:42.739343Z 0 [ERROR] [MY-000000] [Galera] gcs/src/gcs_core.cpp:gcs_core_open():219: Failed to open backend connection: -110 (Connection timed out)\n","file":"/var/lib/mysql/mysqld-error.log"}

As if the pods crashes for some reason and then it cannot synchronize with other pods of the cluster.

Just for you to know, I enabled persistence in helm values and I had to set pxc_strict_mode to PERMISSIVE to support some legacy apps, I hope this is not related.

Does someone have a clue about how to fix this ?

Thanks

Slava_Sarzhan · January 10, 2023, 10:40am

Hey @Ilford,

Do you know why pods crash? Do you have any logs of these crashes?

Ilford · January 10, 2023, 11:00am

The first errors that appears before a crash are these (I’m not sure that this is always this case)

{"log":"2023-01-10T10:22:32.865988Z 0 [ERROR] [MY-000000] [Galera] failed to open gcomm backend connection: 110: failed to reach primary view (pc.wait_prim_timeout): 110 (Connection timed out)\n\t at gcomm/src/pc.cpp:connect():161\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2023-01-10T10:22:32.866038Z 0 [ERROR] [MY-000000] [Galera] gcs/src/gcs_core.cpp:gcs_core_open():219: Failed to open backend connection: -110 (Connection timed out)\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2023-01-10T10:22:33.866582Z 0 [ERROR] [MY-000000] [Galera] gcs/src/gcs.cpp:gcs_open():1811: Failed to open channel 'siam-pxc-db-pxc' at 'gcomm://siam-pxc-db-pxc-0.siam-pxc-db-pxc,siam-pxc-db-pxc-1.siam-pxc-db-pxc': -110 (Connection timed out)\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2023-01-10T10:22:33.866625Z 0 [ERROR] [MY-000000] [Galera] gcs connect failed: Connection timed out\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2023-01-10T10:22:33.866653Z 0 [ERROR] [MY-000000] [WSREP] Provider/Node (gcomm://siam-pxc-db-pxc-0.siam-pxc-db-pxc,siam-pxc-db-pxc-1.siam-pxc-db-pxc) failed to establish connection with cluster (reason: 7)\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2023-01-10T10:22:33.866688Z 0 [ERROR] [MY-010119] [Server] Aborting\n","file":"/var/lib/mysql/mysqld-error.log"}

Ilford · January 10, 2023, 11:43am

Ok, I managed to catch a crash on live, this appears in the logs:

[2023/01/10 11:38:30] [engine] caught signal (SIGTERM)
{"log":"2023-01-10T11:38:30.561075Z 0 [System] [MY-013172] [Server] Received SHUTDOWN from user <via user signal>. Shutting down mysqld (Version: 8.0.29-21.1).\n","file":"/var/lib/mysql/mysqld-error.log"}

And then, impossible to sync with the cluster.

I did not do something special to send this signal to the engine.

For you to know I have multiple pxc-cluster and operators in different namespaces. I did not deployed the operator “cluster wide” but I have the feeling that there is something wrong with the operators, the crashed appeared when I deployed an pxc-cluster in a different namespace. I not sure if it is related.

Ilford · January 10, 2023, 11:51am

I can trigger the crash by deleting the pod, then the statefulset recreates it, but same error appears (with gcomm timeout)

Ilford · January 10, 2023, 12:02pm

I’m playing a lot with my pxc clusters and I’m definitely convinced that there is some relations between two cluster+operators deployed on two separate namespaces. I don’t know how this is possible tough, but each time I act on a cluster (like deleting, restoring a backup etc,), something happens on a other cluster (a pods crashed or a crashed pod fixes itself…)

Slava_Sarzhan · January 10, 2023, 7:03pm

Hi,

Ilford:

[2023/01/10 11:38:30] [engine] caught signal (SIGTERM)
{"log":"2023-01-10T11:38:30.561075Z 0 [System] [MY-013172] [Server] Received SHUTDOWN from user <via user signal>. Shutting down mysqld (Version: 8.0.29-21.1).\n","file":"/var/lib/mysql/mysqld-error.log"}

I think a liveness probe kills the container or you have a problem with resources and k8s killed it. Do you see any ‘OOMKilled’ errors?

Please check the output of the following commands:

kubectl get events --sort-by=.metadata.creationTimestamp -w
kubectl describe pods <your_pod_name>

Please provide your CR as well.

I am not sure how possible it is if you do not use CW. I have an idea if e.g. your new cluster takes some resources from k8s cluster and you do not have any limits by a namespace it can have an influence on existing clusters.

Ilford · January 12, 2023, 12:38pm

Hi,

I tried to inspect the logs but I didn’t see OOM errors.

As I suspected some wrong communications between services, to make some changes I enabled certManager with this values :

pxc-db:
  pxc:  
    certManager: true
  tls:
    issuerConf:
      name: my-issuer
      kind: ClusterIssuer

And now everything works very fluently. If I act as a chaos monkey and I delete some pods manually, every pods synchronizes with the cluster very quickly. I waited a couple days to reply here to be sure but for now I have no crashes (I had systematically a crash after a couple of minutes after the cluster was bootstrapped before).

I cannot explain how this config solved my issue, but I’m glad it did

Thanks for your time !

Topic		Replies	Views
PXC cluster, 3rd pod stuck in CrashLoopBackOff Percona XtraDB Cluster 8.x	1	928	September 20, 2023
SmartUpdate breaks pxc pod when "applying changes" Percona Operator for MySQL	1	77	September 25, 2024
Pods are crashing after deployment Percona XtraDB Cluster 5.x	1	525	September 16, 2020
Percona Operator Backup Fails Percona Operator for MySQL	7	1225	December 5, 2022
Not able to take backup of my pxc XtraDB mysql cluster Percona XtraBackup community , pmm , mysql , percona , new-release , kubernetes	5	392	March 23, 2024

Random pxc node fails with gcomm issues

Related topics