Enabling backups in PSMDB causes threads to grow uncontrollably

Description:

When we set backup.enabled=true in PerconaServerMongoDB we noticed that about 1 day later the MongoDB pods are all in CrashLoopBackOff.

After investigating we discovered that the number of PIDs was exhausted, (we have PID max set to 8192 in our containers).

Observable in real time by, deploying a new PerconaServerMongoDBwith backup.enabled=false and the output of ps -eLf | sort -k4 | wc -l is steady around ~130, with backup.enabled=true the same command the count is growing by 3 every 5 seconds or so.

The cause has been narrowed down to the Percona Operator for MongoDB, as when I run kubectl -n mongodb rollout restart deployment mongodb-operator-psmdb-operator the thread count drops down to a normal level for a few seconds until the operator restarts then the thread count starts climbing again.

It seems that when backups are enabled, the operator connects continuously to the mongod process which keeps spawning new threads that never exit.

Once the PIDs are exhausted, everything else breaks.

Steps to Reproduce:

Using this:

# Ref: https://raw.githubusercontent.com/percona/percona-server-mongodb-operator/v1.21.0/deploy/cr.yaml
apiVersion: psmdb.percona.com/v1
kind: PerconaServerMongoDB
metadata:
  name: "{{ mongodb_config.cluster.name }}"
  namespace: {{ app_namespace }} 
  finalizers:
    - percona.com/delete-psmdb-pods-in-order
spec:
  enableVolumeExpansion: true
  enableExternalVolumeAutoscaling: false
  crVersion: {{ mongodb_config.cluster.version }}
  image: percona/percona-server-mongodb:{{ mongodb_config.images.server }}
  imagePullPolicy: IfNotPresent
  updateStrategy: SmartUpdate
  upgradeOptions:
    versionServiceEndpoint: https://check.percona.com
    apply: disabled
    schedule: "0 2 * * *"
    setFCV: false
  secrets:
    users: {{ mongodb_config.cluster.name }}-secrets
  pmm:
    enabled: false
    image: percona/pmm-client:{{ mongodb_config.images.pmm }}
    serverHost: monitoring-service
  replsets:
  - name: rs0
    configuration: |
      security:
        enableEncryption: false
    size: 3
    affinity:
      antiAffinityTopologyKey: "topology.kubernetes.io/zone"
    sidecars:
    - image: percona/mongodb_exporter:{{ mongodb_config.images.exporter }}
      env:
      - name: EXPORTER_USER
        valueFrom:
          secretKeyRef:
            name: "{{ mongodb_config.cluster.name }}-secrets"
            key: MONGODB_CLUSTER_MONITOR_USER
      - name: EXPORTER_PASS
        valueFrom:
          secretKeyRef:
            name: "{{ mongodb_config.cluster.name }}-secrets"
            key: MONGODB_CLUSTER_MONITOR_PASSWORD
      - name: POD_IP
        valueFrom:
          fieldRef:
            fieldPath: status.podIP
      - name: MONGODB_URI
        value: "mongodb://$(EXPORTER_USER):$(EXPORTER_PASS)@$(POD_IP):27017"
      args: ["--discovering-mode", "--compatible-mode", "--collect-all", "--log.level=debug", "--mongodb.uri=$(MONGODB_URI)"]
      name: metrics
    podDisruptionBudget:
      maxUnavailable: 1
    expose:
      enabled: false
    resources:
      limits:
        cpu: "{{ mongodb_config.resources.limits.cpu }}"
        memory: "{{ mongodb_config.resources.limits.memory }}"
      requests:
        cpu: "{{ mongodb_config.resources.requests.cpu }}"
        memory: "{{ mongodb_config.resources.requests.memory }}"
    volumeSpec:
      persistentVolumeClaim:
        storageClassName: "{{ mongodb_config.storage.className }}"
        accessModes: [ "ReadWriteOnce" ]
        resources:
          requests:
            storage: {{ mongodb_config.storage.capacity }}
    hidden:
      size: 1
      enabled: false
    nonvoting:
      size: 1
      enabled: false
    arbiter:
      size: 1
      enabled: false
  backup:
    enabled: true
    image: percona/percona-backup-mongodb:{{ mongodb_config.images.backup }}
    startingDeadlineSeconds: 300
    storages:
      radosgw:
        type: s3
        s3:
          bucket: "{{ mongodb_bucket_name }}"
          credentialsSecret: mongodb-backup
          endpointUrl: "http://rook-ceph-rgw-ceph-objectstore.rook-ceph.svc.cluster.local:8080"
          prefix: ""
          region: ceph-objectstore
    pitr:
      enabled: false
      compressionType: gzip
      compressionLevel: 6
    tasks:
      - name: daily-backup
        enabled: true
        schedule: "27 7 * * *"
        type: physical
        retention:
          count: 30
          type: count
          deleteFromStorage: true
        storageName: radosgw
        compressionType: gzip
        compressionLevel: 6
  logcollector:
    enabled: false
  unsafeFlags:
    tls: true
  tls:
    mode: disabled

with these variables:

mongo_defaults:
  cluster:
    name: "{{ app_namespace }}-mongodb"
    version: 1.21.0
    port: 27017
  images:
    server: 8.0.12-4
    pmm: 3.4.1
    exporter: "0.36"
    backup: 2.11.0
  storage:
    capacity: 5Gi
    className: topolvm-provisioner
  resources:
    limits:
      cpu: 2
      memory: 2Gi
    requests:
      cpu: 100m
      memory: 2Gi

When backup.enabled=true

% for i in `seq 10`; do kubectl -n test exec test-mongodb-rs0-0 -c mongod -- ps -eLf | sort -k4 | wc -l; sleep 3; done
     566
     566
     569
     569
     572
     575
     575
     577
     577
     580

Restarting the operator you can see the PID count drop off then start climbing as it reconnects.

% kubectl -n mongodb rollout restart deployment mongodb-operator-psmdb-operator
deployment.apps/mongodb-operator-psmdb-operator restarted
% for i in `seq 10`; do kubectl -n test exec test-mongodb-rs0-0 -c mongod -- ps -eLf | sort -k4 | wc -l; sleep 3; done
     608
     611
     118
     118
     123
     118
     118
     121
     121
     124

Version:

    crVersion: 1.21.0
    server: 8.0.12-4
    pmm: 3.4.1
    exporter: "0.36"
    backup: 2.11.0

Logs:

I couldn’t see any errors in the logs in the backup-agent, mongod containers or in the operator logs themselves, everything seems benign.

2025-12-10T08:30:58.000+0000 I log options: log-path=/dev/stderr, log-level:D, log-json:false
2025-12-10T08:30:58.000+0000 I pbm-agent:
Version:   2.11.0
Platform:  linux/amd64
GitCommit: 6ec4853941922f8414c66d7e31baf9b1fd089267
GitBranch: release-2.11.0
BuildTime: 2025-09-22_11:38_UTC
GoVersion: go1.25.1
2025-12-10T08:30:58.000+0000 I starting PITR routine
2025-12-10T08:30:58.000+0000 I node: rs0/test-mongodb-rs0-0.sina-mongodb-rs0.test.svc.cluster.local:27017
2025-12-10T08:30:58.000+0000 E [agentCheckup] check storage connection: unable to get storage: get config: get: mongo: no documents in result
2025-12-10T08:30:58.000+0000 I conn level ReadConcern: majority; WriteConcern: majority
2025-12-10T08:30:58.000+0000 I listening for the commands
2025-12-10T08:31:01.000+0000 I got command resync <ts: 1765355461>, opid: 69392fc55077786e5fd317f1
2025-12-10T08:31:01.000+0000 I got epoch {1765355460 1}
2025-12-10T08:31:01.000+0000 I [resync] started
2025-12-10T08:31:01.000+0000 D [resync] uploading ".pbm.init" [size hint: 6 (6.00B); part size: 10485760 (10.00MB)]
2025-12-10T08:31:02.000+0000 D [resync] got backups list: 0
2025-12-10T08:31:02.000+0000 D [resync] got physical restores list: 0
2025-12-10T08:31:02.000+0000 D [resync] epoch set to {1765355462 5}
2025-12-10T08:31:02.000+0000 I [resync] succeed

Expected Result:

I am not 100% sure but I think the operator is holding open connections instead of closing them properly? So it should close the connection.

Actual Result:

The thread count shouldn’t increase uncontrollably until exhaustion.

Oh, I see, fixed in 1.21.1 Jira

@AptiraSina Yes, we had a hotfix release 1.21.1 to fix this problem.