Percona Operator cannot release backup lock – all subsequent backups stuck in “waiting”

Description:

The Percona Server for MongoDB Operator becomes unable to perform new backups due to the message

“Another backup is holding the lock.”

There appears to be no way to clear this backup lock, even after deleting related Kubernetes resources and restarting the operator.

Steps to Reproduce:

    1. Create manual backup
  • apiVersion: psmdb.percona.com/v1
    kind: PerconaServerMongoDBBackup
    metadata:
      name: manual-BACKUPVERSION
    spec:
      psmdbCluster: x-y-mongodb-cluster
      storageName: x-y-backup
    
    1. Restore manual backup
  • apiVersion: psmdb.percona.com/v1
    kind: PerconaServerMongoDBRestore
    metadata:
      name: manual-restore
    spec:
      clusterName: x-y-mongodb-cluster
      backupName: manual-2025-10-16-10-45-44
    
    1. **Attemp to run additional backups **
      Manual backup or backup by cron job

Version:

  • Percona Operator for MongoDB: 1.20.1
  • Platform: Red Hat OpenShift on AWS (ROSA)
  • OpenShift Version: 4.19.14

Logs:

INFO    Acquiring the backup lock       {"controller": "psmdbbackup-controller", "controllerGroup": "psmdb.percona.com", "controllerKind": "PerconaServerMongoDBBackup", "PerconaServerMongoDBBackup": {"name":"manual-2025-10-20-15-11-13","namespace":"x-y"}, "namespace": "x-y", "name": "manual-2025-10-20-15-11-13", "reconcileID": "xxx"}

INFO    Another backup is holding the lock      {"controller": "psmdbbackup-controller", "controllerGroup": "psmdb.percona.com", "controllerKind": "PerconaServerMongoDBBackup", "PerconaServerMongoDBBackup": {"name":"manual-2025-10-20-15-11-13","namespace":"x-y"}, "namespace": "x-y", "name": "manual-2025-10-20-15-11-13", "reconcileID": "xxx", "holder": "manual-2025-10-16-10-45-44-776ab071-f45d-4d11-a45c-8f04d0e2f20b"}

Expected Result:

  • New backups should start and complete successfully after the previous backup/restore has finished.

Actual Result:

  • All subsequent backups enter the waiting or error state, and the operator logs show:

    kubectl get perconaservermongodbbackups.psmdb.percona.com
    
    | **NAME**                         | **CLUSTER**         | **STORAGE** | **DESTINATION** | **TYPE** | **STATUS** | **COMPLETED** | **AGE** |
    | -------------------------------- | ------------------- | ----------- | --------------- | -------- | ---------- | ------------- | ------- |
    | cron-mongod-20251020130000-5hl7x | x-y-mongodb-cluster | x-y-backup  | —               | —        | error    | —             | 85 m    |
    | cron-mongod-20251020140000-zg9lb | x-y-mongodb-cluster | x-y-backup  | —               | —        | error    | —             | 25 m    |
    | manual-2025-10-20-15-11-13       | —                   | x-y-backup  | —               | —        | error    | —             | 73 m    |
    | manual-2025-10-20-16-24-33       | —                   | x-y-backup  | —               | —        | waiting  | —             | 30 s    |
    
    
    kubectl logs percona-server-mongodb-operator-x-y
    
    INFO    Acquiring the backup lock
    INFO    Another backup is holding the lock {"holder": "manual-2025-10-16-10-45-44-776ab071-f45d-4d11-a45c-8f04d0e2f20b"}
    

    Even after deleting all backup and restore resources, the lock remains.

Additional Information:

  • Deleted the perconaservermongodbbackups.psmdb.percona.com backup CR

  • Deleted the perconaservermongodbrestores.psmdb.percona.com restore CR

  • Deleted all related PVCs

  • Deleted the MongoDB StatefulSet

  • Restarted the percona-server-mongodb-operator deployment and pod

  • Verified no entries exist in:

    • db.pbmLock

    • db.pbmLockOp

Despite these steps, new backups continue to report “Another backup is holding the lock”.

Hi,

That lock is a Kubernetes Lease. You should see psmdb-clusterName-backup-lock if you run kubectl get lease. Could you try deleting it, and see if that fixes the issue?
I haven’t been able to reproduce this problem on my side, unfortunately.

Hi @bofh, could you please check your operator’s log and “grep” the ‘delete lease’ error?

Thanks @Sami_Ahlroos

I found the kubernetes lease with kubectl get lease and was able to delete it with kubectl delete lease

Queed backup-jobs were able to finish successfully and new jobs are running fine..

@Slava_Sarzhan
Have been able to reproduce the behavior again with manual backup / restore procedure.

Here are the logs with the “release lease” error:

2025-10-21T15:58:21.220Z ERROR failed to release the lock {“controller”: “psmdbbackup-controller”, “controllerGroup”: “``psmdb.percona.com``”, “controllerKind”: “PerconaServerMongoDBBackup”, “PerconaServerMongoDBBackup”: {“name”:“manual-2025-10-21-17-58-10”,“namespace”:“x-y”}, “namespace”: “x-y”, “name”: “manual-2025-10-21-17-58-10”, “reconcileID”: “049d05ee-08d6-4b34-a9be-d743bf66cac6”, “error”: “get lease: ``Lease.coordination.k8s.io`` "psmdb–backup-lock" not found”, “errorVerbose”: “``Lease.coordination.k8s.io`` "psmdb–backup-lock" not found\nget lease\``ngithub.com/percona/percona-server-mongodb-operator/pkg/k8s.ReleaseLease\n\t/go/src/github.com/percona/percona-server-mongodb-operator/pkg/k8s/lease.go:52\ngithub.com/percona/percona-server-mongodb-operator/pkg/controller/perconaservermongodbbackup.(*ReconcilePerconaServerMongoDBBackup).Reconcile.func1\n\t/go/src/github.com/percona/percona-server-mongodb-operator/pkg/controller/perconaservermongodbbackup/perconaservermongodbbackup_controller.go:159\ngithub.com/percona/percona-server-mongodb-operator/pkg/controller/perconaservermongodbbackup.(*ReconcilePerconaServerMongoDBBackup).Reconcile\n\t/go/src/github.com/percona/percona-server-mongodb-operator/pkg/controller/perconaservermongodbbackup/perconaservermongodbbackup_controller.go:221\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[…]).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.20.4/pkg/internal/controller/controller.go:119\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[…]).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.20.4/pkg/internal/controller/controller.go:334\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[…]).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.20.4/pkg/internal/controller/controller.go:294\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[…]).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.20.4/pkg/internal/controller/controller.go:255\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1700”``}

Afterwards deleted the lease manually:
kubectl delete lease psmdb-x-y-mongodb-cluster-backup-lock

2025-10-21T16:03:58.338Z INFO Releasing backup lock {“controller”: “psmdbbackup-controller”, “controllerGroup”: “``psmdb.percona.com``”, “controllerKind”: “PerconaServerMongoDBBackup”, “PerconaServerMongoDBBackup”: {“name”:“cron-x-y-mongod-20251021160000-qtl7k”,“namespace”:“x-y”}, “namespace”: “x-y”, “name”: “cron-x-y-mongod-20251021160000-qtl7k”, “reconcileID”: “9672d4fa-621f-4c0c-bd25-797b6f2c4fad”, “lease”: “psmdb-x-y-mongodb-cluster-backup-lock”}

@bofh, pelase check your RBAC and make sure that you have:

- apiGroups:
  - coordination.k8s.io
  resources:
  - leases
  verbs:
  - get
  - list
  - watch
  - create
  - update
  - patch
  - delete

I can’t reproduce it using GKE. We will test it on Red Hat OpenShift on AWS (ROSA) in the next few days. We will update you.

P.S. Do you have a cluster-wide deployment or a namespace-scoped one?

On ROSA, there is no cluster-wide deployment available; therefore, we use a namespace-scoped deployment.

Regarding the RBAC permissions: should these rights be granted to the operator account or to the service account used by the MongoDB service?

Currently, the operator account has these permissions, while the MongoDB service account (default) does not.

percona-server-mongodb-operator

get: yes
list: yes
watch: yes
create: yes
update: yes
patch: yes
delete: yes
default

get: no
list: no
watch: no
create: no
update: no
patch: no
delete: no

Hi @bofh

I wasn’t able to reproduce this issue on ROSA with OpenShift 4.19.14.
Could you share a bit more about how you deployed the operator? For example, did you install it from the repo, using Helm charts, or through OperatorHub (Community or Certified bundle)?

Just a heads-up — PSMDB Operator v1.20.1 hasn’t been tested with OpenShift 4.19.x. It’s only been verified on 4.14–4.18, so while that might not be the exact cause, it could lead to some unexpected issues.