GKE / MongoDB cluster under "stress" not accepting changes from cr.yaml

Hello

Just recently a MongoDB cluster did not have enough memory and failed right after the start when the first request hit the cluster. And the pods restarted “endlessly”.

I adjusted memory setting in cr.yaml and applied. But the changes re. memory were not reflected. And the pods kept going on to restart due to OOM.

If you do this with kubernetes deployment/statefulset/etc - the changes get immediately accepted and that is how pods under stress (for ex. OOM) can be rescued.

But it seems this is NOT the case with the Percona MongoDB cluster.

This issue I have noticed a couple of years ago. And it is still there - sharding and not sharding variants.

The only way to get this resolved is to set replicas = 0 (on kubernetes level !) and then set replicas = 3. But a) you have an interruption and b) currently there is still a problem of the unclean shutdown which causes a recovery when pod/mongodb engine is starting up - and this can take “hours”.

Regards
John

Hi @jamoser, how can I reproduce it? Please provide STR.

Hello

You can simulate it very simply. Have a cluster with some meaningful amount of data and then set memory limit too low. Then the pods will crash due to OOM. In kubernetes you will see the restart count going up.

After like 20-30 restarts (and having queries hitting the cluster), try to apply the memory settings (incr memory to previous stable).

In many cases it fails to apply the new memory settings. It “feels” like the Percona custom resource / Operator prefers “healing” the cluster instead of applying the memory settings to the pods (which would heal the cluster).

Regards
John