I’m deploying on my k8s cluster the pxc-opertor and pxc-db helm charts, both in version 1.12.0.
It works great but after some time one of the cluster pod crashes and in the logs I have this issues :
[ERROR] [MY-000000] [Galera] failed to open gcomm backend connection: 110: failed to reach primary view (pc.wait_prim_timeout): 110 (Connection timed out)\n\t at gcomm/src/pc.cpp:connect():161\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2023-01-10T10:11:42.739343Z 0 [ERROR] [MY-000000] [Galera] gcs/src/gcs_core.cpp:gcs_core_open():219: Failed to open backend connection: -110 (Connection timed out)\n","file":"/var/lib/mysql/mysqld-error.log"}
As if the pods crashes for some reason and then it cannot synchronize with other pods of the cluster.
Just for you to know, I enabled persistence in helm values and I had to set pxc_strict_mode to PERMISSIVE to support some legacy apps, I hope this is not related.
Ok, I managed to catch a crash on live, this appears in the logs:
[2023/01/10 11:38:30] [engine] caught signal (SIGTERM)
{"log":"2023-01-10T11:38:30.561075Z 0 [System] [MY-013172] [Server] Received SHUTDOWN from user <via user signal>. Shutting down mysqld (Version: 8.0.29-21.1).\n","file":"/var/lib/mysql/mysqld-error.log"}
And then, impossible to sync with the cluster.
I did not do something special to send this signal to the engine.
For you to know I have multiple pxc-cluster and operators in different namespaces. I did not deployed the operator “cluster wide” but I have the feeling that there is something wrong with the operators, the crashed appeared when I deployed an pxc-cluster in a different namespace. I not sure if it is related.
I’m playing a lot with my pxc clusters and I’m definitely convinced that there is some relations between two cluster+operators deployed on two separate namespaces. I don’t know how this is possible tough, but each time I act on a cluster (like deleting, restoring a backup etc,), something happens on a other cluster (a pods crashed or a crashed pod fixes itself…)
I think a liveness probe kills the container or you have a problem with resources and k8s killed it. Do you see any ‘OOMKilled’ errors?
Please check the output of the following commands:
kubectl get events --sort-by=.metadata.creationTimestamp -w
kubectl describe pods <your_pod_name>
Please provide your CR as well.
I am not sure how possible it is if you do not use CW. I have an idea if e.g. your new cluster takes some resources from k8s cluster and you do not have any limits by a namespace it can have an influence on existing clusters.
And now everything works very fluently. If I act as a chaos monkey and I delete some pods manually, every pods synchronizes with the cluster very quickly. I waited a couple days to reply here to be sure but for now I have no crashes (I had systematically a crash after a couple of minutes after the cluster was bootstrapped before).
I cannot explain how this config solved my issue, but I’m glad it did