PBM Physical Restore in K8s Not Working

Hey all,

So I was having issues with the logical restores in K8s per the post here. I have since got the helm chart 1.20.1 up and running with the operator and pbm agent 2.10.0. This did not resolve the issue per this post. It appears that it is still an issue with a multi-replicaset setup and mongodb > 6.x.

As a result, I have started looking into physical restores, will likely be necessary in production anyway since they should be significantly faster. However I am seeing some very odd behavior when attempting the physical restores.

Essentially, I kick off the restore with a cr.yaml file;

apiVersion: psmdb.percona.com/v1
kind: PerconaServerMongoDBRestore
metadata:
  name: physical-restore-from-main-1
  namespace: psmdb-dev-reports
spec:
  clusterName: psmdb-dev-reports-psm
  storageName: s3-us-east-physical
  backupSource:
    type: physical
    destination: s3://<my_bucket>/physical/2025-07-14T20:50:22Z
    s3:
      credentialsSecret: psmdb-backup-s3
      bucket: <my_bucket>

Once I kick it off with kubectl apply -f deploy/backup/restore-physical.yaml -n psmdb-dev-reports, it sits for a while, kubectl get psmdb-restore -n psmdb-dev-reports without a status change. During this time, the operator generates no logs nor the pbm agent. After about 5 minutes the operator will kill the mongos instances, as is to be expected.

After this there is still no status change and no further logs in the operator, still no logs in pbm agent with pbm logs -f. I assume it is doing the file restore in the background. After about another 5 minutes, the operator starts throwing logs about Waiting for statefulsets to be ready before restore. Similar to this;

2025-07-15T18:17:50.726Z    INFO    Waiting for statefulsets to be ready before restore    {"controller": "psmdbrestore-controller", "controllerGroup": "psmdb.percona.com", "contr
ollerKind": "PerconaServerMongoDBRestore", "PerconaServerMongoDBRestore": {"name":"physical-restore-from-main-1","namespace":"psmdb-dev-reports"}, "namespace": "psmdb-dev-reports"
, "name": "physical-restore-from-main-1", "reconcileID": "7973623e-ac14-4087-b2d5-ada7cdd37e59", "ready": false}
2025-07-15T18:17:55.727Z    INFO    Waiting for statefulsets to be ready before restore    {"controller": "psmdbrestore-controller", "controllerGroup": "psmdb.percona.com", "contr
ollerKind": "PerconaServerMongoDBRestore", "PerconaServerMongoDBRestore": {"name":"physical-restore-from-main-1","namespace":"psmdb-dev-reports"}, "namespace": "psmdb-dev-reports"
, "name": "physical-restore-from-main-1", "reconcileID": "8322868f-7960-4800-b7a3-6120ef2863c7", "ready": false}
2025-07-15T18:18:53.902Z    INFO    SmartUpdate    apply changes to secondary pod    {"controller": "psmdb-controller", "controllerGroup": "psmdb.percona.com", "controllerKind": "
PerconaServerMongoDB", "PerconaServerMongoDB": {"name":"psmdb-dev-reports-psm","namespace":"psmdb-dev-reports"}, "namespace": "psmdb-dev-reports", "name": "psmdb-dev-reports-psm",
 "reconcileID": "6c946aa9-f1dc-457d-8f2a-65d963d1310e", "statefulset": "psmdb-dev-reports-psm-amfam", "replset": "amfam", "pod": "psmdb-dev-reports-psm-amfam-1"}
2025-07-15T18:19:34.797Z    INFO    Pod started    {"controller": "psmdb-controller", "controllerGroup": "psmdb.percona.com", "controllerKind": "PerconaServerMongoDB", "PerconaSer
verMongoDB": {"name":"psmdb-dev-reports-psm","namespace":"psmdb-dev-reports"}, "namespace": "psmdb-dev-reports", "name": "psmdb-dev-reports-psm", "reconcileID": "6c946aa9-f1dc-457
d-8f2a-65d963d1310e", "pod": "psmdb-dev-reports-psm-amfam-1"}

At this point it starts restarting all of the pods, which I also think is to be expected. However, when the pods come back online, they are missing the backup-agent agent container. This results in the backup eventually failing stating that there are no pbm agents available for the restore. I was following pbm logs up to the point of the the pod being restarted and nothing was ever posted to it. I did however see a few errors in admin.pbmlog for each node of the RS noting msg: 'mark error during restore: check mongod binary: run: exec: "mongod": executable file not found in $PATH. stderr: '. Not sure if these were related to the backup though as I was only able to look after the pods were restarted and mongos came back up and I could connect.

The other really odd thing is that I have tried deleting pods, patching the helm chart etc. in order to get the pbm-agent container back. The only two things that seem to work are deletion of the stateful sets, or a complete terragrunt destroy/apply.

We are running EKS on AWS currently v1.32, helm chart for percona is 1.20.1, physical restore attempted with backup-agent 2.10.0 and 2.9.1. It should also be noted that I have tried numerous variations of the restore yaml. Including removing the s3 section as all of that is defined in the storage name. Also credentialsSecret: psmdb-backup-s3 is there with the correct AWS secrets.

Helm values for the mongodb deploy;

# Cluster DNS Suffix
# clusterServiceDNSSuffix: svc.cluster.local
# clusterServiceDNSMode: "Internal"

finalizers:
  - percona.com/delete-psmdb-pods-in-order

nameOverride: ""
fullnameOverride: ""

crVersion: 1.20.1
pause: false
unmanaged: false
unsafeFlags:
  tls: false
  replsetSize: false
  mongosSize: false
  terminationGracePeriod: false
  backupIfUnhealthy: false

enableVolumeExpansion: true

annotations: {}

multiCluster:
  enabled: false

updateStrategy: SmartUpdate
upgradeOptions:
  versionServiceEndpoint: https://check.percona.com
  apply: disabled
  schedule: "0 2 * * *"
  setFCV: false

image:
  repository: percona/percona-server-mongodb
  tag: 7.0.12-7
  # tag: 7.0.18-11
imagePullPolicy: Always

secrets:
  encryptionKey: psmdb-encryption-key
  users: psmdb-users-secrets

pmm:
  enabled: true
  image:
    repository: percona/pmm-client
    tag: 2.44.1
  serverHost: pmm.override.in.terraform.threatx.io

replsets:
  rs0:
    name: rs0
    size: 3
    affinity:
      antiAffinityTopologyKey: "kubernetes.io/hostname"
    podDisruptionBudget:
      maxUnavailable: 1
    expose:
      enabled: false
      type: ClusterIP
    resources:
      limits:
        cpu: "4"
        memory: "4G"
      requests:
        cpu: "1"
        memory: "1G"
    volumeSpec:
      pvc:
        storageClassName: gp3-mongo-standard-unencrypted
        resources:
          requests:
            storage: 25Gi
  
  amfam:
    name: amfam
    size: 3
    affinity:
      antiAffinityTopologyKey: "kubernetes.io/hostname"
    podDisruptionBudget:
      maxUnavailable: 1
    expose:
      enabled: false
      type: ClusterIP
    resources:
      limits:
        cpu: "4"
        memory: "4G"
      requests:
        cpu: "1"
        memory: "1G"
    volumeSpec:
      pvc:
        storageClassName: gp3-mongo-standard-unencrypted
        resources:
          requests:
            storage: 25Gi

sharding:
  enabled: true
  balancer:
    enabled: true

  configrs:
    size: 3
    affinity:
      antiAffinityTopologyKey: "kubernetes.io/hostname"
    podDisruptionBudget:
      maxUnavailable: 1
    expose:
      enabled: false
      type: ClusterIP
    resources:
      limits:
        cpu: "1"
        memory: "1G"
      requests:
        cpu: "300m"
        memory: "0.5G"
    volumeSpec:
      pvc:
        storageClassName: gp3-mongo-standard-unencrypted
        resources:
          requests:
            storage: 3Gi

  mongos:
    size: 3
    affinity:
      antiAffinityTopologyKey: "kubernetes.io/hostname"
    podDisruptionBudget:
      maxUnavailable: 1
    resources:
      limits:
        cpu: "4"
        memory: "4G"
      requests:
        cpu: "1"
        memory: "1G"
    expose:
      enabled: true
      exposeType: LoadBalancer
      servicePerPod: true
      annotations:
        service.beta.kubernetes.io/aws-load-balancer-type: "external"
        service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: ip

backup:
  enabled: true
  image:
    repository: percona/percona-backup-mongodb
    tag: 2.9.1
  resources:
    limits:
      cpu: "3"
      memory: "3G"
    requests:
      cpu: "1"
      memory: "1G"
  storages:
    s3-us-east-logical:
      # main: true
      type: s3
      s3:
        bucket: derrived-from-terraform
        credentialsSecret: psmdb-backup-s3
        serverSideEncryption:
          kmsKeyID: derrived-from-terraform
          sseAlgorithm: aws:kms
        region: us-east-2
        prefix: "logical"
        storageClass: INTELLIGENT_TIERING
    s3-us-east-physical:
      main: true
      type: s3
      s3:
        bucket: derrived-from-terraform
        credentialsSecret: psmdb-backup-s3
        serverSideEncryption:
          kmsKeyID: derrived-from-terraform
          sseAlgorithm: aws:kms
        region: us-east-2
        prefix: "physical"
        storageClass: INTELLIGENT_TIERING
  pitr:
    enabled: false
    oplogOnly: false
  tasks:

Happy to provide any additional information if I can provide it. If anyone has any thoughts it would be greatly appreciated.

Ok, some more information from what I can gather but not a go expert.

It looks like the process does the following;

  1. removes the mongos stateful set. - makes sense bring down mongos.
  2. Does some magic in the background, likely pulling data from aws and prepping it for restore.
  3. Triggers a patch to the pbm stateful set. This appears to be where the problem resides. After the patch the pods come back but without the backup-agent container. Once the stateful set is back, we can see that it has the init but nothing for the actual container. Our best guess at this point is that it waits for 5 minutes for all of the stateful sets to be updated, however with a multi replicaset this may take longer than 5 min. It then fails or exits without restoring the stateful set back to its original state.
2025-07-16T17:04:23.725Z  INFO  Waiting for statefulsets to be ready before restore
.....
 2025-07-16T17:09:53.855Z  INFO  Waiting for statefulsets to be ready before restore
2025-07-16T17:09:58.917Z	ERROR	failed to make restore	{"controller": "psmdbrestore-controller", "controllerGroup": "psmdb.percona.com", "controllerKind": "PerconaServerMongoDBRestore", "PerconaServerMongoDBRestore": {"name":"physical-restore-from-main-1","namespace":"psmdb-dev-reports"}, "namespace": "psmdb-dev-reports", "name": "physical-restore-from-main-1", "reconcileID": "ffaf9619-720f-4e9e-be70-41f5abd497e3", "restore": "physical-restore-from-main-1", "backup": "", "error": "check if pbm agents are ready: get pbm status: command terminated with exit code 1", "errorVerbose": "command terminated with exit code 1\nget pbm status\ngithub.com/percona/percona-server-mongodb-operator/pkg/controller/perconaservermongodbrestore.(*ReconcilePerconaServerMongoDBRestore).checkIfPBMAgentsReadyForPhysicalRestore.func2\n\t/go/src/github.com/percona/percona-server-mongodb-operator/pkg/controller/perconaservermongodbrestore/physical.go:1031\nk8s.io/client-go/util/retry.OnError.func1\n\t/go/pkg/mod/k8s.io/client-go@v0.33.0/util/retry/util.go:51\nk8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtection\n\t/go/pkg/mod/k8s.io/apimachinery@v0.33.0/pkg/util/wait/wait.go:150\nk8s.io/apimachinery/pkg/util/wait.ExponentialBackoff\n\t/go/pkg/mod/k8s.io/apimachinery@v0.33.0/pkg/util/wait/backoff.go:477\nk8s.io/client-go/util/retry.OnError\n\t/go/pkg/mod/k8s.io/client-go@v0.33.0/util/retry/util.go:50\ngithub.com/percona/percona-server-mongodb-operator/pkg/controller/perconaservermongodbrestore.(*ReconcilePerconaServerMongoDBRestore).checkIfPBMAgentsReadyForPhysicalRestore\n\t/go/src/github.com/percona/percona-server-mongodb-operator/pkg/controller/perconaservermongodbrestore/physical.go:1014\ngithub.com/percona/percona-server-mongodb-operator/pkg/controller/perconaservermongodbrestore.(*ReconcilePerconaServerMongoDBRestore).reconcilePhysicalRestore\n\t/go/src/github.com/percona/percona-server-mongodb-operator/pkg/controller/perconaservermongodbrestore/physical.go:130\ngithub.com/percona/percona-server-mongodb-operator/pkg/controller/perconaservermongodbrestore.(*ReconcilePerconaServerMongoDBRestore).Reconcile\n\t/go/src/github.com/percona/percona-server-mongodb-operator/pkg/controller/perconaservermongodbrestore/perconaservermongodbrestore_controller.go:250\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.20.4/pkg/internal/controller/controller.go:119\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.20.4/pkg/internal/controller/controller.go:334\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.20.4/pkg/internal/controller/controller.go:294\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.20.4/pkg/internal/controller/controller.go:255\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1700\ncheck if pbm agents are ready\ngithub.com/percona/percona-server-mongodb-operator/pkg/controller/perconaservermongodbrestore.(*ReconcilePerconaServerMongoDBRestore).reconcilePhysicalRestore\n\t/go/src/github.com/percona/percona-server-mongodb-operator/pkg/controller/perconaservermongodbrestore/physical.go:132\ngithub.com/percona/percona-server-mongodb-operator/pkg/controller/perconaservermongodbrestore.(*ReconcilePerconaServerMongoDBRestore).Reconcile\n\t/go/src/github.com/percona/percona-server-mongodb-operator/pkg/controller/perconaservermongodbrestore/perconaservermongodbrestore_controller.go:250\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.20.4/pkg/internal/controller/controller.go:119\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.20.4/pkg/internal/controller/controller.go:334\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.20.4/pkg/internal/controller/controller.go:294\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.20.4/pkg/internal/controller/controller.go:255\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1700"}

I am going to revert to an earlier version of the operator/helm charts like 1.19.1 and see if it happens there as well but need to leave it in current state for the moment as we analyze.

Ok, I have some more information.

It looks like what is happening is that the crd patch makes changes to the mongod container by installing pbm as needed for the restore. That install is to /opt/percona/pbm. However if I try to execute that command on a node that is in a currently bad state I get the following GLIBC errors.

[mongodb@psmdb-dev-reports-psm-amfam-0 db]$ /opt/percona/pbm
/opt/percona/pbm: /lib64/libc.so.6: version `GLIBC_2.32' not found (required by /opt/percona/pbm)
/opt/percona/pbm: /lib64/libc.so.6: version `GLIBC_2.34' not found (required by /opt/percona/pbm)

Best guess is that it is coming from this function in ./pkg/controller/perconaservermongodbrestore/physical.go

func getPBMBinaryAndContainerForExec(pod *corev1.Pod) (string, string) {
        container := "mongod"
        pbmBinary := "/opt/percona/pbm"

        for _, c := range pod.Spec.Containers {
                if c.Name == naming.ContainerBackupAgent {
                        return naming.ContainerBackupAgent, "pbm"
                }
        }

        return container, pbmBinary
}

Any thoughts?

Hi @dclark,

This happens because your PBM docker image and PSMDB docker image has different base images. Starting from v7.0.16, base image is changed in PSMDB docker image. Would it be possible for you to use PSMDB >=v7.0.16?

It is a new reporting DB so I can def try. May not be able to get to it today but happy to give it a whirl and see what happens. I will let you know.

Thanks!