How does Percona mongodb operator handles corrupted disk pods?

Darko · February 9, 2022, 2:56pm

Hello,

I am testing the mongodb operator with 3 replica sets. I have created a new empty cluster with 3 replica sets. Then on purpose I bash in one of the replica set pods and delete some files inside /data/db to simulate disk corruption on that pod. The pod goes in “CrashLoopBackOff” status and never recovers.

NAME READY STATUS RESTARTS AGE
pod/my-cluster-name-rs0-0 1/1 Running 0 55m
pod/my-cluster-name-rs0-1 1/1 Running 0 53m
pod/my-cluster-name-rs0-2 0/1 CrashLoopBackOff 12 (62s ago) 52m
pod/percona-client 1/1 Running 0 130m
pod/percona-server-mongodb-operator-5dd88ff7f7-pxrs8 1/1 Running 0 142m

NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/kubernetes ClusterIP 10.96.0.1 443/TCP 3h10m
service/my-cluster-name-rs0 ClusterIP None 27017/TCP 142m

NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/percona-server-mongodb-operator 1/1 1 1 142m

NAME DESIRED CURRENT READY AGE
replicaset.apps/percona-server-mongodb-operator-5dd88ff7f7 1 1 1 142m

NAME READY AGE
statefulset.apps/my-cluster-name-rs0 2/3 142m

I assumed, when the operator detects that one pod is crashing, it will reinitialize it and replicate the data from the working pods. But this pod stays CrashLoopBackOff state and does not recover.

Is there any way to configure the operator, if some replica set pod is crashing several times, to re initiate it automatically ? In this case if I delete the pod manually, it will be crated again in Ready state.

Sergey_Pronin · February 11, 2022, 2:11pm

Hey @Darko .

I think it is possible.
For now the solution would be to delete the PVC of this Pod.

To automate it, we need to think about the logic.
What do you think should be the logic here for the Operator? How does it know that it is time to delete the PVC and if the data was corrupted?
Just thinking out loud, will discuss it with our team as well.

Darko · February 11, 2022, 5:02pm

Hey @Sergey_Pronin,

In a case when something goes wrong and the pod crashes - cannot start, it changes the status and then goes to state “CrashLoopBackOff” (or maybe Error) and has endless restarts. Maybe after x restarts, if the status still goes in “CrashLoopBackOff”, then the pvc and the pod should be automatically deleted, so the replica can be regenerated ?

jamoser · April 21, 2022, 9:51am

Just had the same issue, where 2 of 3 failed to start. Because of this 3 of 3 did also not start and the cluster was not available.

Imo this is a no go - I mean why do we have redundancy at all.

Sergey_Pronin · April 21, 2022, 10:20am

@jamoser what happened exactly?
We are discussing the issue when one of the nodes with local storage goes down and it requires to delete the PVC (and later recreate it).

jamoser · April 22, 2022, 9:56am

@Sergey_Pronin I have no idea. The issue was when MongoDB inside the pod started, then it was visible in the logs that at some MongoB just crashed (instant). It did not log why it crashed. Also my knowledge of the internas of MongoDB are limited so I can’t say what was nok. On the other hand it was on a “sleepy” dev server, whereas “productive” systems with heavy load seem to run flawlessly - touching wood right now …

Topic		Replies	Views
Getting CrashLoopBackoff in rs pods when installing to vanilla k8s Percona Operator for MongoDB	6	2543	July 22, 2020
Percona Cluster crashed and does not want to startup Percona Operator for MongoDB	4	820	July 14, 2021
Percona Server MongoDB stuck in initializing Percona Operator for MongoDB percona , mongodb , psmdb-operator	4	1907	February 21, 2023
Percona mongo DB operator K8S resize physical volume backup error Percona Operator for MongoDB	0	579	July 5, 2019
CrashLoopBackOff error Percona Operator for MongoDB	1	927	May 26, 2023

How does Percona mongodb operator handles corrupted disk pods?

Related topics