Description:
Sometimes an instance has to be restarted (e.g. when draining a node or another reason for an eviction kills the mysql-pod) and after starting up again it needs to recover and such a recovery can take some time, depending on how many changes/binlog entries have accumulated in the meantime.
Unfortunately the health_check does watch for the state “ONLINE” in the GR - but during that operation it is “RECOVERY” - which will lead into a crashloop and therefore restarting the pod. Which will lead into again even more binlogs and so-on.
We cope with that by setting manually the “sleep-forever” file to have the health_check blinded and giving the pod the time to recover.
Wouldn’t it be much nicer to modify the health_check to consider “RECOVERY” as also “ok’ish”?
Steps to Reproduce:
- Setup a cluster
- Put in loads of data
- Kill one pod and change data on the other nodes in parallel
- See the newly spawned pod starting and getting killed over and over again (got a DB here sitting there for days and got 27,000times killed until one noticed^^)
Version:
0.6.0 and probably 0.7.0
Expected Result:
Health_check doesn’t suppress RECOVERY
Additional Information:
Same issue we also discovered when restoring a cluster out of backup when we got more than an recent amount of data (~40GByte) from s3-backup. The primary starts fine after the xtrabackup was copied in, but then the first secondary got killed after about 25-36GByte into “copy from donor” - due to health_check issues… BTW: This renders the restore unusable for this scenario.
Any ideas? Fixes?
Best,
Ingo