Kubernetes XtraDB Cluster Operator frequently restarts

Philipp_Malkmus · March 6, 2024, 11:17pm

We’ve been experiencing issues with our XtraDB Cluster Operator, which has been restarting multiple times daily. These restarts appear to be triggered by errors during the leader election process. This issue occurs in both our 1.13.0/8.0.34-26.1 and 1.14.0/8.0.35-27.1 clusters.

Typically, the sequence of events unfolds as follows:

HAProxy stops several processes and reports: “write error: Broken pipe.”
The cluster operator encounters an error retrieving the resource lock for the database, fails to renew the lease for the database due to a timeout while waiting for a condition, and consequently loses the leader election.
Subsequently, the cluster operator restarts.
PXC undertakes some actions related to Galera processes.
The cluster operator then reconnects to the database.

This entire process concludes in less than a minute and would hardly be noticeable if it weren’t for the operator pod restarting. However, it does impact connections and operations within the database.

I am at a loss as to what could be causing this issue. I hope you can provide some guidance. I have included the full logs from one of the incidents, as well as the Cluster CR, for further analysis.

Logs:
Logs-logs-2024-03-06 23_54_33.txt (247.6 KB)
PXC CR:
cr.txt (15.2 KB)

Sergey_Pronin · March 7, 2024, 8:10pm

Hello @Philipp_Malkmus ,

the most obvious culprit would be another copy of an operator running. Are you sure you have one operator pod? It also can be another operator pod in another namespace, but in cluster wide mode. They are fighting with each other for your PXC custom resources.

But also I’m curious that it harms the connections to the database. That should not happen.

Another option would be that Lease Duration is too small for your cluster. It can be an issue if you run a tiny cluster and control plane is just slow.

Philipp_Malkmus · March 8, 2024, 12:30pm

Hello @Sergey_Pronin

I made sure that only one operator is installed.
Are there any metrics I could provide, like PMM or prometheus, that would help with debugging?
Unfortunately I don’t know where to look up the lease duration.
It seems to be independent of cluster size as it happens on both our small dev cluster as well as our bigger prod cluster. We are using Kubernetes 1.26 by the way.

Thank you for the support so far.

Topic		Replies	Views
LockOwner after partitioning the node where the Operator is running (Kubernetes) Percona XtraDB Cluster 8.x mysql , percona , kubernetes	3	598	April 25, 2022
Operator fails to rejoin crashed nodes to cluster without deleting it manually Percona Operator for MySQL	3	155	December 19, 2024
Proper way to restart Percona XtraDB cluster Percona Operator for MySQL	7	2180	September 28, 2021
PXC cluster fails after single pod failure Percona Operator for MySQL	4	540	March 11, 2024
Percona XtraDB Cluster Operator: /var/lib/mysql has wrong permissions Percona Operator for MySQL	3	3074	September 9, 2020

Kubernetes XtraDB Cluster Operator frequently restarts

Related topics