After upgrading to Percona Server for MongoDB Operator 1.21.0 (Helm chart psmdb-db 1.21.0), MongoDB pods in a non-sharded replica set (percona/percona-server-mongodb:7.0.24-13) are restarting intermittently. The mongod container logs repeated pthread_create failed errors, and the operator frequently reports “FULL CLUSTER CRASH” followed by leader election lost errors. Rolling back to 1.20.0 immediately resolves the problem.
Steps to Reproduce:
Deploy psmdb-operator Helm chart 1.21.0.
Deploy psmdb-db Helm chart 1.21.0 with image percona/percona-server-mongodb:7.0.24-13.
Observe that MongoDB pods start normally.
After several minutes or hours, the pods start failing due to liveness probe checks or internal errors, causing them to restart.
Operator logs show repeated “FULL CLUSTER CRASH” messages and occasionally terminate due to leader election lost.
exec numactl --interleave=all mongod --bind_ip_all --auth --dbpath=/data/db --port=27017 --replSet=rs0 ...
{"t":{"$date":"2025-10-25T13:26:40.744Z"},"s":"W","c":"CONTROL","id":23321,"ctx":"main","msg":"Option: This name is deprecated. Please use the preferred name instead.","attr":{"deprecatedName":"sslPEMKeyFile","preferredName":"tlsCertificateKeyFile"}}
[1761402410:995451][1:0x7f9144a65640], sweep-server: [WT_VERB_DEFAULT][WARNING]: Session 15 did not run a sweep for 60 minutes.
[1761402410:995491][1:0x7f9144a65640], sweep-server: [WT_VERB_DEFAULT][WARNING]: Session 16 did not run a sweep for 60 minutes.
ERROR(4850900): pthread_create failed
ERROR(4850900): pthread_create failed
Operator:
ERROR FULL CLUSTER CRASH error: ping mongo: server selection error: server selection timeout, current topology:
{ Type: ReplicaSetNoPrimary, Servers: [
{ Addr: psmdb-db-rs0-0..., Type: RSSecondary },
{ Addr: psmdb-db-rs0-1..., Type: Unknown, Last error: dial tcp: lookup psmdb-db-rs0-1...: no such host },
{ Addr: psmdb-db-rs0-2..., Type: RSSecondary }
] }
E1027 14:07:15 leaderelection.go:441] Failed to update lock optimistically: context deadline exceeded
E1027 14:07:15 leaderelection.go:448] error retrieving resource lock ...: context deadline exceeded
I1027 14:07:15 leaderelection.go:297] failed to renew lease ...: context deadline exceeded
ERROR setup problem running manager {"error": "leader election lost"}
Expected Result:
MongoDB pods remain healthy and stable.
Operator maintains leadership and performs normal reconciliations without restarts or “FULL CLUSTER CRASH” events.
Actual Result:
MongoDB pods periodically restart as a result of liveness probe failures or internal errors.
Operator logs “FULL CLUSTER CRASH” and loses leadership.
Replicaset members temporarily lose connectivity (no such host DNS errors).
Additional Information:
Using TLS mode: preferTLS.
The cluster is running on EKS; reverting both the operator and database to Helm chart version 1.20.0 immediately resolves all issues.
Thanks for reporting this issue — I’ll try to reproduce it.
From the steps you provided, it’s not clear for me whether the issue occurs only when upgrading from a previous chart, on a fresh deployment or both.
If it happens during an upgrade, could you share the exact steps you followed for the upgrade process?
Also, could you let me know the Kubernetes version of your EKS cluster?
Thanks for your message!
At first I thought the issue only happened during upgrades, but I redeployed a new database from scratch and still saw the same behavior — so it happens in both cases.
During the upgrade, I updated the operator chart first, then the database chart.
I’m running on EKS 1.30, but I also tried on 1.33 and the issue still occurs.
I’m upgrading through ArgoCD with syncOptions: ServerSideApply=true.
I can confirm that the CRDs already include the new labels introduced in this release, and the new logcollector container is being deployed in the pods.
I’ve been searching for a solution to this problem for days and came across this topic. I encountered the same problem in my 3-shard setup. Simply downgrading the operator to 1.20 solved the issues.
I use Percona (it was fresh install, not upgrade) on RKE2 + Rancher.
Thanks for replying @shepz I tried to reproduce the issue but had no luck on EKS 1.33. Based on your description, I’m using psmdb-operator version 1.21.0 and deploying the psmdb-db chart with.
The deployment has been running for over an hour without any issues.
@shepz@Islam_Saka Could you share more details about your CRs or Helm values to help pinpoint the problem? Also, knowing the exact image versions you were using at the time of the error would be helpful.
Can you try reproducing by upgrading the operator 1.21.0 or 1.21.1 (rke2 charts provide two different versions, and both of them produce the same result).
We discovered a connection leak issue in version 1.21.0, and we will release a hotfix soon, as discussed in this topic: Percona Operator for MongoDB endlessly spawning connections until OOMKilled
It’s likely that the number of connections keeps increasing until it reaches your configured memory limits, causing the pods to restart.
Hey @Julio_Pasinatto Thanks for the follow-up! I deployed the hotfix in my dev environment — so far, the pods haven’t restarted. I’ll keep an eye on it over the next couple of days and let you know how it goes.