Percona Operator for MongoDB endlessly spawning connections until OOMKilled

Description:

Percona Operator for MongoDB 1.21.0 is continually creating connections to cluster until operator pod crashes from OOMKilled. Sorry if this post is lacking information as it’s my first post here :slight_smile:

Steps to Reproduce:

Launch the Percona Operator 1.21.0 on an existing cluster with PerconaServerMongoDB on AWS EKS 1.34 following Install on Amazon Elastic Kubernetes Service (AWS EKS) - Percona Operator for MongoDB
After operator is live, query your MongoDB cluster and get the connections-by-IP ordered list:

db.getSiblingDB("admin").aggregate([
  { $currentOp: { allUsers: true, idleConnections: true } },
  { $project: {
      clientIP: { 
        $arrayElemAt: [{ $split: ["$client", ":"] }, 0]
      }
    }
  },
  { $group: { 
      _id: "$clientIP",
      connections: { $sum: 1 }
    }
  },
  { $sort: { connections: -1 } },
  { $limit: 20 }
])

Resolve the top result’s Pod: kubectl get pods -A -o wide | grep <IP>

Version:

1.21.0

Logs:

2025-10-28T06:37:46.453Z        INFO    setup   Manager starting up     {"gitCommit": "c7a8f111326700320a918e134b8522f79e702cc1", "gitBranch": "release-1-21-0", "buildTime": "", "goVersion": "go1.25.3", "os": "linux", "arch": "amd64"}
2025-10-28T06:37:46.481Z        INFO    server version  {"platform": "kubernetes", "version": "v1.34.1-eks-d96d92f"}
2025-10-28T06:37:46.498Z        INFO    controller-runtime.metrics      Starting metrics server
2025-10-28T06:37:46.498Z        INFO    starting server {"name": "health probe", "addr": "[::]:8081"}
I1028 06:37:46.498849       1 leaderelection.go:257] attempting to acquire leader lease psmdb/08db0feb.percona.com...
2025-10-28T06:37:46.499Z        INFO    controller-runtime.metrics      Serving metrics server  {"bindAddress": ":8080", "secure": false}
I1028 06:38:04.781174       1 leaderelection.go:271] successfully acquired lease psmdb/08db0feb.percona.com
2025-10-28T06:38:04.781Z        INFO    Starting EventSource    {"controller": "psmdb-controller", "controllerGroup": "psmdb.percona.com", "controllerKind": "PerconaServerMongoDB", "source": "kind source: *v1.PerconaServerMongoDB"}
2025-10-28T06:38:04.782Z        INFO    Starting EventSource    {"controller": "psmdbrestore-controller", "controllerGroup": "psmdb.percona.com", "controllerKind": "PerconaServerMongoDBRestore", "source": "kind source: *v1.Pod"}
2025-10-28T06:38:04.782Z        INFO    Starting EventSource    {"controller": "psmdbbackup-controller", "controllerGroup": "psmdb.percona.com", "controllerKind": "PerconaServerMongoDBBackup", "source": "kind source: *v1.Pod"}
2025-10-28T06:38:04.782Z        INFO    Starting EventSource    {"controller": "psmdbrestore-controller", "controllerGroup": "psmdb.percona.com", "controllerKind": "PerconaServerMongoDBRestore", "source": "kind source: *v1.PerconaServerMongoDBRestore"}
2025-10-28T06:38:04.782Z        INFO    Starting EventSource    {"controller": "psmdbbackup-controller", "controllerGroup": "psmdb.percona.com", "controllerKind": "PerconaServerMongoDBBackup", "source": "kind source: *v1.PerconaServerMongoDBBackup"}
2025-10-28T06:38:04.890Z        INFO    Starting Controller     {"controller": "psmdb-controller", "controllerGroup": "psmdb.percona.com", "controllerKind": "PerconaServerMongoDB"}
2025-10-28T06:38:04.891Z        INFO    Starting workers        {"controller": "psmdb-controller", "controllerGroup": "psmdb.percona.com", "controllerKind": "PerconaServerMongoDB", "worker count": 1}
2025-10-28T06:38:04.891Z        INFO    Starting Controller     {"controller": "psmdbrestore-controller", "controllerGroup": "psmdb.percona.com", "controllerKind": "PerconaServerMongoDBRestore"}
2025-10-28T06:38:04.891Z        INFO    Starting workers        {"controller": "psmdbrestore-controller", "controllerGroup": "psmdb.percona.com", "controllerKind": "PerconaServerMongoDBRestore", "worker count": 1}
2025-10-28T06:38:04.891Z        INFO    Starting Controller     {"controller": "psmdbbackup-controller", "controllerGroup": "psmdb.percona.com", "controllerKind": "PerconaServerMongoDBBackup"}
2025-10-28T06:38:04.891Z        INFO    Starting workers        {"controller": "psmdbbackup-controller", "controllerGroup": "psmdb.percona.com", "controllerKind": "PerconaServerMongoDBBackup", "worker count": 1}
2025-10-28T06:38:05.037Z        INFO    Creating or updating backup job {"controller": "psmdb-controller", "controllerGroup": "psmdb.percona.com", "controllerKind": "PerconaServerMongoDB", "PerconaServerMongoDB": {"name":"db1","namespace":"psmdb"}, "namespace": "psmdb", "name": "db1", "reconcileID": "5474899e-98d2-4dd1-9ceb-541a9c6c59c5", "name": "db1-daily", "namespace": "psmdb", "schedule": "0 12 * * *"}
2025-10-28T06:38:05.037Z        INFO    Creating or updating backup job {"controller": "psmdb-controller", "controllerGroup": "psmdb.percona.com", "controllerKind": "PerconaServerMongoDB", "PerconaServerMongoDB": {"name":"db1","namespace":"psmdb"}, "namespace": "psmdb", "name": "db1", "reconcileID": "5474899e-98d2-4dd1-9ceb-541a9c6c59c5", "name": "db1-weekly", "namespace": "psmdb", "schedule": "0 0 * * 6"}
2025-10-28T06:38:05.037Z        INFO    Creating or updating backup job {"controller": "psmdb-controller", "controllerGroup": "psmdb.percona.com", "controllerKind": "PerconaServerMongoDB", "PerconaServerMongoDB": {"name":"db1","namespace":"psmdb"}, "namespace": "psmdb", "name": "db1", "reconcileID": "5474899e-98d2-4dd1-9ceb-541a9c6c59c5", "name": "db1-monthly", "namespace": "psmdb", "schedule": "0 0 1 * *"}
2025-10-28T06:38:05.763Z        INFO    add new job     {"controller": "psmdb-controller", "controllerGroup": "psmdb.percona.com", "controllerKind": "PerconaServerMongoDB", "PerconaServerMongoDB": {"name":"db1","namespace":"psmdb"}, "namespace": "psmdb", "name": "db1", "reconcileID": "5474899e-98d2-4dd1-9ceb-541a9c6c59c5", "job": "ensure-version/psmdb/db1", "name": "ensure-version/psmdb/db1", "schedule": "0 2 * * *"}
2025-10-28T06:38:05.763Z        INFO    add new job     {"controller": "psmdb-controller", "controllerGroup": "psmdb.percona.com", "controllerKind": "PerconaServerMongoDB", "PerconaServerMongoDB": {"name":"db1","namespace":"psmdb"}, "namespace": "psmdb", "name": "db1", "reconcileID": "5474899e-98d2-4dd1-9ceb-541a9c6c59c5", "job": "telemetry/psmdb/db1", "name": "telemetry/psmdb/db1", "schedule": "38 * * * *"}
2025-10-28T06:38:59.585Z        INFO    PBM     updating latest restorable time {"controller": "psmdb-controller", "controllerGroup": "psmdb.percona.com", "controllerKind": "PerconaServerMongoDB", "PerconaServerMongoDB": {"name":"db1","namespace":"psmdb"}, "namespace": "psmdb", "name": "db1", "reconcileID": "e6c100f4-8801-4700-803c-374d0e5c72c8", "backup": "cron-db1-20251027120000-27lsh", "latestRestorableTime": "2025-10-28 06:38:53 +0000 UTC"}
2025-10-28T06:48:54.010Z        INFO    PBM     updating latest restorable time {"controller": "psmdb-controller", "controllerGroup": "psmdb.percona.com", "controllerKind": "PerconaServerMongoDB", "PerconaServerMongoDB": {"name":"db1","namespace":"psmdb"}, "namespace": "psmdb", "name": "db1", "reconcileID": "85d397d9-b39b-4744-a975-d82c0fddc455", "backup": "cron-db1-20251027120000-27lsh", "latestRestorableTime": "2025-10-28 06:48:53 +0000 UTC"}

The logs don’t indicate anything out of the ordinary (to my knowledge).

Expected Result:

A healthy amount of connections are run for reconciliation and checks

Actual Result:

Over 8K connections are spawned before OOMKilled and the problem starts again. Connection summaries from PMM connected to cluster:

Additional Information:

Connections are being launched about 1+/s.

cr.yaml
apiVersion: psmdb.percona.com/v1  
kind: PerconaServerMongoDB  
metadata:  
  name: **********************  
  namespace: **********************  
  finalizers:  
    - percona.com/delete-psmdb-pods-in-order  
spec:  
  pause: false  
  enableVolumeExpansion: false  
  crVersion: 1.21.0  
  image: percona/percona-server-mongodb:7.0.18-11  
  imagePullPolicy: Always  
  updateStrategy: SmartUpdate  
  upgradeOptions:  
    versionServiceEndpoint: https://check.percona.com  
    apply: disabled  
    schedule: "0 2 * * *"  
    setFCV: false  
  secrets:  
    users: **********************  
    encryptionKey: **********************  
  pmm:  
    enabled: true  
    image: perconalab/pmm-client:3  
    serverHost: monitoring-service  
  replsets:  
    - name: rs0  
      size: 3  
      configuration: |  
        operationProfiling:
	        slowOpThresholdMs: 200
	        mode: slowOp
	        rateLimit: 100
	  affinity:  
        advanced:  
          nodeAffinity:  
            preferredDuringSchedulingIgnoredDuringExecution:  
              - preference:  
                  matchExpressions:  
                    - key: workload  
                      operator: In  
                      values:  
                        - mongodb  
                weight: 50  
      podDisruptionBudget:  
        maxUnavailable: 1  
      expose:  
        enabled: false  
      resources:  
        limits:  
          cpu: "4"  
          memory: "16G"  
        requests:  
          cpu: "3"  
          memory: "12G"  
      volumeSpec:  
        persistentVolumeClaim:  
          storageClassName: auto-ebs-dbstorage-1  
          resources:  
            requests:  
              storage: 220Gi  
  backup:  
    enabled: true  
    image: percona/percona-backup-mongodb:2.9.1  
    storages:  
      **********************:  
        main: true  
        type: s3  
        s3:  
          bucket: **********************  
          credentialsSecret: **********************  
          endpointUrl: **********************  
          region: **********************
          serverSideEncryption: { }  
    pitr:  
      enabled: true  
      oplogOnly: false  
      compressionType: gzip  
      compressionLevel: 6  
    tasks:  
      # =============  
      #  DAILY BACKUP at 12PM UTC 0
      # =============
      - enabled: true  
        retention:  
          count: 3 # keep 1 backup each day, 3 days max  
          type: count  
          deleteFromStorage: true  
        type: logical # best for storage, slower  
        name: **********************  
        schedule: 0 12 * * * # daily at 12PM UTC 0  
        storageName: **********************
  
      # =============  
      #  WEEKLY BACKUP at SATURDAY 12AM UTC 0
      # =============
      - enabled: true  
        retention:  
          count: 2 # keep 1 backup each week, 2 weeks max  
          type: count  
          deleteFromStorage: true  
        type: logical # best for storage, slower  
        name: **********************  
        schedule: 0 0 * * 6 # weekly at 12AM UTC 0 Saturday  
        storageName: **********************  
  
      # =============  
      #  MONTHLY BACKUP at SATURDAY 12AM UTC 0
      # ============= 
      - enabled: true  
        retention:  
          count: 3 # keep 1 backup each month, 3 months max  
          type: count  
          deleteFromStorage: true  
        type: logical # best for storage, slower  
        name: **********************  
        schedule: 0 0 1 * * # weekly at 12AM UTC 0 Saturday  
        storageName: mdb-publishing-backup
4 Likes

Same here!

OpenShift: 4.18.21 on AWS ROSA
Kubernetes: 1.31.10
Operator: 1.21.0

2 Likes

@Pim @GianniC could you please provide your CRs? It will speed up our research.

1 Like

Here is our CR. Or do you need any other specific CRs?

apiVersion: psmdb.percona.com/v1
kind: PerconaServerMongoDB
metadata:
  annotations:
  creationTimestamp: "2025-10-21T15:03:56Z"
  generation: 7
  name: company-app-mongodb-cluster
  namespace: company-namespace
spec:
  allowUnsafeConfigurations: false
  backup:
    enabled: true
    image: percona/percona-backup-mongodb:2.9.1
    resources:
      limits:
        cpu: 300m
        memory: 0.5G
    serviceAccountName: forms-operator
    storages:
      company-formio-backup:
        s3:
          bucket: backup-forms-production
          credentialsSecret: s3backup
          region: eu-central-1
        type: s3
    tasks:
    - compressionType: gzip
      enabled: true
      keep: 14
      name: backup
      schedule: 0 0 * * *
      storageName: company-formio-backup
  crVersion: 1.20.1
  image: percona/percona-server-mongodb:6.0.21
  imagePullPolicy: Always
  imagePullSecrets:
  - name: mk8s-imagepullsecret
  pmm:
    enabled: false
    image: percona/pmm-client:3.1.0
    serverHost: monitoring-service
  replsets:
  - affinity:
      antiAffinityTopologyKey: kubernetes.io/hostname
    arbiter:
      affinity:
        antiAffinityTopologyKey: kubernetes.io/hostname
      enabled: false
      size: 1
    configuration: |
      net:
        port: 27017
      security:
        redactClientLogData: false
        enableEncryption: true
        encryptionCipherMode: AES256-CBC
      setParameter:
        ttlMonitorSleepSecs: 60
        wiredTigerConcurrentReadTransactions: 128
        wiredTigerConcurrentWriteTransactions: 128
      storage:
        engine: wiredTiger
        wiredTiger:
          engineConfig:
            directoryForIndexes: false
            journalCompressor: snappy
          collectionConfig:
            blockCompressor: snappy
          indexConfig:
            prefixCompression: true
      operationProfiling:
        mode: slowOp
        slowOpThresholdMs: 100
    containerSecurityContext:
      runAsNonRoot: true
    expose:
      enabled: false
      exposeType: LoadBalancer
    name: rs0
    podDisruptionBudget:
      maxUnavailable: 1
    resources:
      limits:
        cpu: 1000m
        memory: 4G
      requests:
        cpu: 1000m
        memory: 4G
    size: 3
    volumeSpec:
      persistentVolumeClaim:
        resources:
          requests:
            storage: 3Gi
        storageClassName: gp3-csi
  secrets:
    users: company-namespace-cluster-secret
  unsafeFlags:
    replsetSize: true
status:
  backupVersion: 2.9.1
  conditions:
  - lastTransitionTime: "2025-10-21T15:03:56Z"
    status: "False"
    type: sharding
  - lastTransitionTime: "2025-10-21T15:04:01Z"
    status: "True"
    type: initializing
  - lastTransitionTime: "2025-10-21T15:05:37Z"
    message: 'manage sys users: undefined or not exist user name MONGODB_DATABASE_ADMIN_USER'
    reason: ErrorReconcile
    status: "True"
    type: error
  - lastTransitionTime: "2025-10-21T15:21:42Z"
    status: "True"
    type: ready
  host: company-app-mongodb-cluster-rs0.company-namespace.svc.cluster.local
  mongoImage: percona/percona-server-mongodb:6.0.21
  mongoVersion: 6.0.21-18
  observedGeneration: 7
  ready: 3
  replsets:
    rs0:
      initialized: true
      members:
        company-app-mongodb-cluster-rs0-0:
          name: company-app-mongodb-cluster-rs0-0.company-app-mongodb-cluster-rs0.company-namespace.svc.cluster.local:27017
          state: 2
          stateStr: SECONDARY
        company-app-mongodb-cluster-rs0-1:
          name: company-app-mongodb-cluster-rs0-1.company-app-mongodb-cluster-rs0.company-namespace.svc.cluster.local:27017
          state: 2
          stateStr: SECONDARY
        company-app-mongodb-cluster-rs0-2:
          name: company-app-mongodb-cluster-rs0-2.company-app-mongodb-cluster-rs0.company-namespace.svc.cluster.local:27017
          state: 1
          stateStr: PRIMARY
      ready: 3
      size: 3
      status: ready
  size: 3
  state: ready
1 Like

Similar problem here, caused by backup to be enabled – operator-pbm-ctl user constantly creates new connections until replica pods OOM.

Few approaches that helped with the problem

  • Having backup disabled (and restarting the operator pod)
  • Having backup enabled and using operator image 1.20.1 with the operator chart 1.21.0 (the solution used now)
resource "helm_release" "mongodb_operator" {
  name       = "mongodb-operator"
  repository = "https://percona.github.io/percona-helm-charts/"
  chart      = "psmdb-operator"
  version    = "1.21.0"
  namespace  = kubernetes_namespace_v1.mongodb.metadata[0].name

  values = [yamlencode({
    image = {
      tag = "1.20.1"
    }
  })]
}

@GianniC ok, so you just updated Operator and do not update DB cluster (do not bump the CR version). OK, will try to do the same

We did update the CR Version. The CR I’ve posted was taken after rollback.

1 Like

I have reproduced the problem and created a task Jira. We will release a hotfix soon.

4 Likes

I’ve added my cr.yaml to the original post, thank you for your support.

2 Likes

thank you for reporting :slight_smile:

2 Likes

Hi @Pim @GianniC @dimitrib we have released a hotfix Percona Operator for MongoDB 1.21.1 (2025-10-30) - Percona Operator for MongoDB. Thanks for your help!

4 Likes

Thank you very much, I can confirm this hotfix has resolved my issues. Thank you for your rapid attention and for providing this service!

Wow … that was fast. We need to wait for Redhat to have it available in their Openshift Marketplace (Operatorhub) to test it.