We’ve been experiencing issues with our XtraDB Cluster Operator, which has been restarting multiple times daily. These restarts appear to be triggered by errors during the leader election process. This issue occurs in both our 1.13.0/8.0.34-26.1 and 1.14.0/8.0.35-27.1 clusters.
Typically, the sequence of events unfolds as follows:
- HAProxy stops several processes and reports: “write error: Broken pipe.”
- The cluster operator encounters an error retrieving the resource lock for the database, fails to renew the lease for the database due to a timeout while waiting for a condition, and consequently loses the leader election.
- Subsequently, the cluster operator restarts.
- PXC undertakes some actions related to Galera processes.
- The cluster operator then reconnects to the database.
This entire process concludes in less than a minute and would hardly be noticeable if it weren’t for the operator pod restarting. However, it does impact connections and operations within the database.
I am at a loss as to what could be causing this issue. I hope you can provide some guidance. I have included the full logs from one of the incidents, as well as the Cluster CR, for further analysis.
Logs:
Logs-logs-2024-03-06 23_54_33.txt (247.6 KB)
PXC CR:
cr.txt (15.2 KB)