Description:
When we set backup.enabled=true in PerconaServerMongoDB we noticed that about 1 day later the MongoDB pods are all in CrashLoopBackOff.
After investigating we discovered that the number of PIDs was exhausted, (we have PID max set to 8192 in our containers).
Observable in real time by, deploying a new PerconaServerMongoDBwith backup.enabled=false and the output of ps -eLf | sort -k4 | wc -l is steady around ~130, with backup.enabled=true the same command the count is growing by 3 every 5 seconds or so.
The cause has been narrowed down to the Percona Operator for MongoDB, as when I run kubectl -n mongodb rollout restart deployment mongodb-operator-psmdb-operator the thread count drops down to a normal level for a few seconds until the operator restarts then the thread count starts climbing again.
It seems that when backups are enabled, the operator connects continuously to the mongod process which keeps spawning new threads that never exit.
Once the PIDs are exhausted, everything else breaks.
Steps to Reproduce:
Using this:
# Ref: https://raw.githubusercontent.com/percona/percona-server-mongodb-operator/v1.21.0/deploy/cr.yaml
apiVersion: psmdb.percona.com/v1
kind: PerconaServerMongoDB
metadata:
name: "{{ mongodb_config.cluster.name }}"
namespace: {{ app_namespace }}
finalizers:
- percona.com/delete-psmdb-pods-in-order
spec:
enableVolumeExpansion: true
enableExternalVolumeAutoscaling: false
crVersion: {{ mongodb_config.cluster.version }}
image: percona/percona-server-mongodb:{{ mongodb_config.images.server }}
imagePullPolicy: IfNotPresent
updateStrategy: SmartUpdate
upgradeOptions:
versionServiceEndpoint: https://check.percona.com
apply: disabled
schedule: "0 2 * * *"
setFCV: false
secrets:
users: {{ mongodb_config.cluster.name }}-secrets
pmm:
enabled: false
image: percona/pmm-client:{{ mongodb_config.images.pmm }}
serverHost: monitoring-service
replsets:
- name: rs0
configuration: |
security:
enableEncryption: false
size: 3
affinity:
antiAffinityTopologyKey: "topology.kubernetes.io/zone"
sidecars:
- image: percona/mongodb_exporter:{{ mongodb_config.images.exporter }}
env:
- name: EXPORTER_USER
valueFrom:
secretKeyRef:
name: "{{ mongodb_config.cluster.name }}-secrets"
key: MONGODB_CLUSTER_MONITOR_USER
- name: EXPORTER_PASS
valueFrom:
secretKeyRef:
name: "{{ mongodb_config.cluster.name }}-secrets"
key: MONGODB_CLUSTER_MONITOR_PASSWORD
- name: POD_IP
valueFrom:
fieldRef:
fieldPath: status.podIP
- name: MONGODB_URI
value: "mongodb://$(EXPORTER_USER):$(EXPORTER_PASS)@$(POD_IP):27017"
args: ["--discovering-mode", "--compatible-mode", "--collect-all", "--log.level=debug", "--mongodb.uri=$(MONGODB_URI)"]
name: metrics
podDisruptionBudget:
maxUnavailable: 1
expose:
enabled: false
resources:
limits:
cpu: "{{ mongodb_config.resources.limits.cpu }}"
memory: "{{ mongodb_config.resources.limits.memory }}"
requests:
cpu: "{{ mongodb_config.resources.requests.cpu }}"
memory: "{{ mongodb_config.resources.requests.memory }}"
volumeSpec:
persistentVolumeClaim:
storageClassName: "{{ mongodb_config.storage.className }}"
accessModes: [ "ReadWriteOnce" ]
resources:
requests:
storage: {{ mongodb_config.storage.capacity }}
hidden:
size: 1
enabled: false
nonvoting:
size: 1
enabled: false
arbiter:
size: 1
enabled: false
backup:
enabled: true
image: percona/percona-backup-mongodb:{{ mongodb_config.images.backup }}
startingDeadlineSeconds: 300
storages:
radosgw:
type: s3
s3:
bucket: "{{ mongodb_bucket_name }}"
credentialsSecret: mongodb-backup
endpointUrl: "http://rook-ceph-rgw-ceph-objectstore.rook-ceph.svc.cluster.local:8080"
prefix: ""
region: ceph-objectstore
pitr:
enabled: false
compressionType: gzip
compressionLevel: 6
tasks:
- name: daily-backup
enabled: true
schedule: "27 7 * * *"
type: physical
retention:
count: 30
type: count
deleteFromStorage: true
storageName: radosgw
compressionType: gzip
compressionLevel: 6
logcollector:
enabled: false
unsafeFlags:
tls: true
tls:
mode: disabled
with these variables:
mongo_defaults:
cluster:
name: "{{ app_namespace }}-mongodb"
version: 1.21.0
port: 27017
images:
server: 8.0.12-4
pmm: 3.4.1
exporter: "0.36"
backup: 2.11.0
storage:
capacity: 5Gi
className: topolvm-provisioner
resources:
limits:
cpu: 2
memory: 2Gi
requests:
cpu: 100m
memory: 2Gi
When backup.enabled=true
% for i in `seq 10`; do kubectl -n test exec test-mongodb-rs0-0 -c mongod -- ps -eLf | sort -k4 | wc -l; sleep 3; done
566
566
569
569
572
575
575
577
577
580
Restarting the operator you can see the PID count drop off then start climbing as it reconnects.
% kubectl -n mongodb rollout restart deployment mongodb-operator-psmdb-operator
deployment.apps/mongodb-operator-psmdb-operator restarted
% for i in `seq 10`; do kubectl -n test exec test-mongodb-rs0-0 -c mongod -- ps -eLf | sort -k4 | wc -l; sleep 3; done
608
611
118
118
123
118
118
121
121
124
Version:
crVersion: 1.21.0
server: 8.0.12-4
pmm: 3.4.1
exporter: "0.36"
backup: 2.11.0
Logs:
I couldn’t see any errors in the logs in the backup-agent, mongod containers or in the operator logs themselves, everything seems benign.
2025-12-10T08:30:58.000+0000 I log options: log-path=/dev/stderr, log-level:D, log-json:false
2025-12-10T08:30:58.000+0000 I pbm-agent:
Version: 2.11.0
Platform: linux/amd64
GitCommit: 6ec4853941922f8414c66d7e31baf9b1fd089267
GitBranch: release-2.11.0
BuildTime: 2025-09-22_11:38_UTC
GoVersion: go1.25.1
2025-12-10T08:30:58.000+0000 I starting PITR routine
2025-12-10T08:30:58.000+0000 I node: rs0/test-mongodb-rs0-0.sina-mongodb-rs0.test.svc.cluster.local:27017
2025-12-10T08:30:58.000+0000 E [agentCheckup] check storage connection: unable to get storage: get config: get: mongo: no documents in result
2025-12-10T08:30:58.000+0000 I conn level ReadConcern: majority; WriteConcern: majority
2025-12-10T08:30:58.000+0000 I listening for the commands
2025-12-10T08:31:01.000+0000 I got command resync <ts: 1765355461>, opid: 69392fc55077786e5fd317f1
2025-12-10T08:31:01.000+0000 I got epoch {1765355460 1}
2025-12-10T08:31:01.000+0000 I [resync] started
2025-12-10T08:31:01.000+0000 D [resync] uploading ".pbm.init" [size hint: 6 (6.00B); part size: 10485760 (10.00MB)]
2025-12-10T08:31:02.000+0000 D [resync] got backups list: 0
2025-12-10T08:31:02.000+0000 D [resync] got physical restores list: 0
2025-12-10T08:31:02.000+0000 D [resync] epoch set to {1765355462 5}
2025-12-10T08:31:02.000+0000 I [resync] succeed
Expected Result:
I am not 100% sure but I think the operator is holding open connections instead of closing them properly? So it should close the connection.
Actual Result:
The thread count shouldn’t increase uncontrollably until exhaustion.