Percona Cluster crashed and does not want to startup

jamoser · July 7, 2021, 4:41pm

Hello

Currently the replica set does not want to restart :

luz-mongodb-cluster-cfg-0 2/2 Running 0 9d
luz-mongodb-cluster-cfg-1 2/2 Running 0 9d
luz-mongodb-cluster-cfg-2 2/2 Running 0 9d
luz-mongodb-cluster-mongos-79d78475f6-852mp 0/1 Running 0 9d
luz-mongodb-cluster-mongos-79d78475f6-tmvbb 0/1 Running 0 9d
luz-mongodb-cluster-mongos-79d78475f6-tsxwm 0/1 Running 0 9d
luz-mongodb-cluster-rs0-0 0/1 CrashLoopBackOff 6 9d
luz-mongodb-cluster-rs0-1 0/1 CrashLoopBackOff 10 9d
luz-mongodb-cluster-rs0-2 0/1 Running 8 9d
percona-server-mongodb-operator-586b769b44-vvhcp 1/1 Running 0 10d

Thanks & Regards
John

jamoser · July 7, 2021, 9:51pm

Replying to my own question:

if rs0 are out of memory then they will not recover
you have to delete the rs0 pods
then after start of the rs0-0 the recovery will fail with the default settings (depending on the amount of data) it will get killed by the liveness probe
setting it to 5min (or even higher) will make the recovery run

→ cluster is up again

jamoser · July 7, 2021, 9:52pm

May be a question to percona : we lost all the index*.wt files after deleting the rs0 pods. How is this possible ?

Sergey_Pronin · July 12, 2021, 10:02am

Hello @jamoser ,

how is your storage configured? Normally index files would be stored on the Persistent Volume Claims, so only removal of PVC might cause this.
But if you have local storage configured, then it is another story.

Interesting case on recovery failure. What do you think Operator should do?
I don’t think liveness probe should kick in if there is recovery going on.

jamoser · July 14, 2021, 8:22am

Hi spronin

so only removal of PVC might cause this.

we are using the default setting, which is like this

storage:
  engine: wiredTiger
  inMemory:
    engineConfig:
      inMemorySizeRatio: 0.9
  wiredTiger:
    collectionConfig:
      blockCompressor: snappy
    engineConfig:
      cacheSizeRatio: 0.5
      directoryForIndexes: false
      journalCompressor: snappy
    indexConfig:
      prefixCompression: true

According to the Percona Docs it is stated that the in-memory-engine should never ever experience a out of memory event (OOM Killed). Well this exactly happened and I am wondering why this can be.

is this the usual Kubernetes problem, that pods do not respect the memory/limits ?
or is inMemorySizeRatio: 0.9 + cacheSizeRatio: 0.5 → 1.4 the problem ?

Anyway …
a) is there a way to turn off the in-memory-engine ?
b) to set the cacheSizeRatio in GB (then I would assume it would def. not use more then the given value).

I don’t think liveness probe should kick in if there is recovery going on.

We have currently ca. 20’000 databases each with 4 collections which result in about 80’000 files (coll + indexes). When the mongod starts in the container then it could be that it takes some time - so the default livenessprobe value of 30 sec is not sufficient in all cases - we’ve set it to 5min and that seems to be ok (for now) but on the other hand if a pod gets move to another node then there are side effects (very long response times). So what is not clear for me is, whether the slow startup is due to the in-memory-engine or if MongoDB takes that long to read all the files. Or would it help to keep the cache sizes to a minimum ?

Regards
John

Topic		Replies	Views
Restart of mongod pods - WiredTiger Cache, Percona Memory Engine Percona Operator for MongoDB	2	891	July 21, 2021
Crash during index building (out of mem) Percona Server for MongoDB	1	529	March 22, 2016
How does Percona mongodb operator handles corrupted disk pods? Percona Operator for MongoDB	5	982	April 22, 2022
Slow startup of mongod Percona Operator for MongoDB	7	2831	July 21, 2021
Percona Server for mongodb crash Percona Server for MongoDB	6	1121	May 4, 2016

Percona Cluster crashed and does not want to startup

Related topics