Percona Operator for MySQL - GroupReplication - health_check-issues when instance is in recovery

Ingo.Bez · May 10, 2024, 1:50pm

Description:

Sometimes an instance has to be restarted (e.g. when draining a node or another reason for an eviction kills the mysql-pod) and after starting up again it needs to recover and such a recovery can take some time, depending on how many changes/binlog entries have accumulated in the meantime.
Unfortunately the health_check does watch for the state “ONLINE” in the GR - but during that operation it is “RECOVERY” - which will lead into a crashloop and therefore restarting the pod. Which will lead into again even more binlogs and so-on.
We cope with that by setting manually the “sleep-forever” file to have the health_check blinded and giving the pod the time to recover.
Wouldn’t it be much nicer to modify the health_check to consider “RECOVERY” as also “ok’ish”?

Steps to Reproduce:

Setup a cluster
Put in loads of data
Kill one pod and change data on the other nodes in parallel
See the newly spawned pod starting and getting killed over and over again (got a DB here sitting there for days and got 27,000times killed until one noticed^^)

Version:

0.6.0 and probably 0.7.0

Expected Result:

Health_check doesn’t suppress RECOVERY

Additional Information:

Same issue we also discovered when restoring a cluster out of backup when we got more than an recent amount of data (~40GByte) from s3-backup. The primary starts fine after the xtrabackup was copied in, but then the first secondary got killed after about 25-36GByte into “copy from donor” - due to health_check issues… BTW: This renders the restore unusable for this scenario.

Any ideas? Fixes?

Best,
Ingo

Ingo.Bez · May 13, 2024, 3:35pm

Got it more or less fixed by changing the STS’s startupProbe.timeoutSeconds from (default) 300 to 3000 and it got enough time to copy the data in.
Nethertheless: quite a bit annoying

Sergey_Pronin · May 20, 2024, 8:36am

Hello @Ingo.Bez ,

thanks for the feedback! Seems valid.

We will look into it.

I’m extremely curious what drove your decision to pick the Operator for MySQL with group replication vs the one that we have based on Percona XtraDB Cluster.

If you have time, would be great to discuss your existing operator issues and discuss the problem space overall. Feel free to schedule something here: Zoom Scheduler

ofeige · May 22, 2024, 12:15pm

I am a colleague of Ingo and would like to answer the question.

We mainly wanted to stay with a single primary (master / slave) configuration. We always had extreme problems with multi-primary configurations, especially with our self-written software.

Sergey_Pronin · May 23, 2024, 8:04am

Thanks @ofeige .

With Percona XtraDB Cluster and our Operator for it you can still use it as a single primary with multiple replicas. And it is what we usually recommend as well.

I would assume that you have bigger issues with the nature of syncronous replication. Is it correct?

ofeige · May 23, 2024, 8:40am

Yes, that is correct. We had problems with deviating behavior during automatic incrementation, slower performance during large write operations and a lack of knowledge among developers and administrators. Therefore, we decided in the past not to use Galera Cluster or something similar.

ofeige · July 25, 2024, 12:20pm

@Sergey_Pronin thanks a lot for fixing this behavior. I tested the 0.8.0 and it works just fine

Topic		Replies	Views
MySQL instance is in CrashLoopBackOff after killing pod Percona Operator for MySQL	0	84	January 27, 2025
Percona Operator, MySQL server has gone away, ProxySQL pods restart fixes the issue Percona XtraDB Cluster 8.x	1	486	February 7, 2024
Failover Problem with PS DB and Percona Operator Percona Operator for MySQL	6	1079	December 8, 2022
BUG with FIX: Found a syntax bug in the liveness-check Percona Operator for MySQL percona	7	811	June 27, 2023
Operator mySQL Server: cannot upgrade from 8.0.35 to 8.0.41 Percona Operator for MySQL	0	27	May 28, 2025