Percona Cluster crashed and does not want to startup

Hello

Currently the replica set does not want to restart :

luz-mongodb-cluster-cfg-0 2/2 Running 0 9d
luz-mongodb-cluster-cfg-1 2/2 Running 0 9d
luz-mongodb-cluster-cfg-2 2/2 Running 0 9d
luz-mongodb-cluster-mongos-79d78475f6-852mp 0/1 Running 0 9d
luz-mongodb-cluster-mongos-79d78475f6-tmvbb 0/1 Running 0 9d
luz-mongodb-cluster-mongos-79d78475f6-tsxwm 0/1 Running 0 9d
luz-mongodb-cluster-rs0-0 0/1 CrashLoopBackOff 6 9d
luz-mongodb-cluster-rs0-1 0/1 CrashLoopBackOff 10 9d
luz-mongodb-cluster-rs0-2 0/1 Running 8 9d
percona-server-mongodb-operator-586b769b44-vvhcp 1/1 Running 0 10d

Thanks & Regards
John

1 Like

Replying to my own question:

  • if rs0 are out of memory then they will not recover
  • you have to delete the rs0 pods
  • then after start of the rs0-0 the recovery will fail with the default settings (depending on the amount of data) it will get killed by the liveness probe
  • setting it to 5min (or even higher) will make the recovery run

→ cluster is up again

1 Like

May be a question to percona : we lost all the index*.wt files after deleting the rs0 pods. How is this possible ?

1 Like

Hello @jamoser ,

how is your storage configured? Normally index files would be stored on the Persistent Volume Claims, so only removal of PVC might cause this.
But if you have local storage configured, then it is another story.

Interesting case on recovery failure. What do you think Operator should do?
I don’t think liveness probe should kick in if there is recovery going on.

1 Like

Hi spronin

so only removal of PVC might cause this.

we are using the default setting, which is like this

storage:
  engine: wiredTiger
  inMemory:
    engineConfig:
      inMemorySizeRatio: 0.9
  wiredTiger:
    collectionConfig:
      blockCompressor: snappy
    engineConfig:
      cacheSizeRatio: 0.5
      directoryForIndexes: false
      journalCompressor: snappy
    indexConfig:
      prefixCompression: true

According to the Percona Docs it is stated that the in-memory-engine should never ever experience a out of memory event (OOM Killed). Well this exactly happened and I am wondering why this can be.

  1. is this the usual Kubernetes problem, that pods do not respect the memory/limits ?
  2. or is inMemorySizeRatio: 0.9 + cacheSizeRatio: 0.5 → 1.4 the problem ?

Anyway …
a) is there a way to turn off the in-memory-engine ?
b) to set the cacheSizeRatio in GB (then I would assume it would def. not use more then the given value).

I don’t think liveness probe should kick in if there is recovery going on.

We have currently ca. 20’000 databases each with 4 collections which result in about 80’000 files (coll + indexes). When the mongod starts in the container then it could be that it takes some time - so the default livenessprobe value of 30 sec is not sufficient in all cases - we’ve set it to 5min and that seems to be ok (for now) but on the other hand if a pod gets move to another node then there are side effects (very long response times). So what is not clear for me is, whether the slow startup is due to the in-memory-engine or if MongoDB takes that long to read all the files. Or would it help to keep the cache sizes to a minimum ?

Regards
John

1 Like