Kubernetes XtraDB Cluster Operator frequently restarts

We’ve been experiencing issues with our XtraDB Cluster Operator, which has been restarting multiple times daily. These restarts appear to be triggered by errors during the leader election process. This issue occurs in both our 1.13.0/8.0.34-26.1 and 1.14.0/8.0.35-27.1 clusters.

Typically, the sequence of events unfolds as follows:

  • HAProxy stops several processes and reports: “write error: Broken pipe.”
  • The cluster operator encounters an error retrieving the resource lock for the database, fails to renew the lease for the database due to a timeout while waiting for a condition, and consequently loses the leader election.
  • Subsequently, the cluster operator restarts.
  • PXC undertakes some actions related to Galera processes.
  • The cluster operator then reconnects to the database.

This entire process concludes in less than a minute and would hardly be noticeable if it weren’t for the operator pod restarting. However, it does impact connections and operations within the database.

I am at a loss as to what could be causing this issue. I hope you can provide some guidance. I have included the full logs from one of the incidents, as well as the Cluster CR, for further analysis.

Logs:
Logs-logs-2024-03-06 23_54_33.txt (247.6 KB)
PXC CR:
cr.txt (15.2 KB)

Hello @Philipp_Malkmus ,

the most obvious culprit would be another copy of an operator running. Are you sure you have one operator pod? It also can be another operator pod in another namespace, but in cluster wide mode. They are fighting with each other for your PXC custom resources.

But also I’m curious that it harms the connections to the database. That should not happen.

Another option would be that Lease Duration is too small for your cluster. It can be an issue if you run a tiny cluster and control plane is just slow.

Hello @Sergey_Pronin

I made sure that only one operator is installed.
Are there any metrics I could provide, like PMM or prometheus, that would help with debugging?
Unfortunately I don’t know where to look up the lease duration.
It seems to be independent of cluster size as it happens on both our small dev cluster as well as our bigger prod cluster. We are using Kubernetes 1.26 by the way.

Thank you for the support so far.