PBM For Mongo in K8s Restores Failing with Agent > 2.5.0

Hey all,

Hoping someone can help me as I have exhausted all ideas I can thing of. I have percona for mongodb running in k8s. Taking nightly backups. We want to create a reporting DB and restore our backups into that DB. I have the new reporting DB up using all the same passwords, aws creds etc. and a logical restore kubectl apply -f deploy/backup/restore-logical.yaml seems to work with backup-agent 2.5.0.

However after a short while, the mongos instances start failing because of a tripwire error. Upon research this looked to be a known issue and resolved in 2.9.x or 2.10.0. However when I try running the restore after upgrading to 2.10.0 I keep getting the following errors;

Error: waiting for start: cluster failed: waiting for start: cluster failed: failed to ensure snapshot file 2025-02-11T00:00:21Z/amfam/metadata.json: get S3 object header: Forbidden: Forbidden
status code: 403, request id: 5WHY4MSZWQCM4PAH, host id: GVS2sxMgU2VsabSwVdydbzYyhQbWPMbzvthYBSdcnvk245G3nkrC7KZDOLItQ7AWz9+u5AwXJxI=

I have tried updating both the user policy and the bucket policy to acomodate. they both already had getobject and getobjectacl etc but even moved to s3:* just to see if it resolved the issue but still no luck. I have the KMS key id, the aws user info etc in secrets and they match the cluster I am trying to restore from. In fact i can just downgrade the pbm-agent to 2.5.0 and everything works fine from the restore perspective but then I have the tripwire error issue.

Example restore yaml;

apiVersion: psmdb.percona.com/v1
kind: PerconaServerMongoDBRestore
metadata:
  name: logical-restore-from-main-1
  namespace: psmdb-dev-reports
spec:
  clusterName: psmdb-dev-reports-psm
  storageName: s3-us-east-logical
  backupSource:
    destination: s3://<my_bucket>/logical/2025-02-11T00:00:21Z

pbm status

Cluster:
========
amfam:
  - psmdb-dev-reports-psm-amfam-0.psmdb-dev-reports-psm-amfam.psmdb-dev-reports.svc.cluster.local:27017 [P]: pbm-agent [v2.9.1] OK
  - psmdb-dev-reports-psm-amfam-1.psmdb-dev-reports-psm-amfam.psmdb-dev-reports.svc.cluster.local:27017 [S]: pbm-agent [v2.9.1] OK
  - psmdb-dev-reports-psm-amfam-2.psmdb-dev-reports-psm-amfam.psmdb-dev-reports.svc.cluster.local:27017 [S]: pbm-agent [v2.9.1] OK
rs0:
  - psmdb-dev-reports-psm-rs0-0.psmdb-dev-reports-psm-rs0.psmdb-dev-reports.svc.cluster.local:27017 [P]: pbm-agent [v2.9.1] OK
  - psmdb-dev-reports-psm-rs0-1.psmdb-dev-reports-psm-rs0.psmdb-dev-reports.svc.cluster.local:27017 [S]: pbm-agent [v2.9.1] OK
  - psmdb-dev-reports-psm-rs0-2.psmdb-dev-reports-psm-rs0.psmdb-dev-reports.svc.cluster.local:27017 [S]: pbm-agent [v2.9.1] OK
cfg:
  - psmdb-dev-reports-psm-cfg-0.psmdb-dev-reports-psm-cfg.psmdb-dev-reports.svc.cluster.local:27017 [P]: pbm-agent [v2.9.1] OK
  - psmdb-dev-reports-psm-cfg-1.psmdb-dev-reports-psm-cfg.psmdb-dev-reports.svc.cluster.local:27017 [S]: pbm-agent [v2.9.1] OK
  - psmdb-dev-reports-psm-cfg-2.psmdb-dev-reports-psm-cfg.psmdb-dev-reports.svc.cluster.local:27017 [S]: pbm-agent [v2.9.1] OK


PITR incremental backup:
========================
Status [OFF]

Currently running:
==================
(none)

Backups:
========
S3 us-east-2 s3:///<my_bucket>/logical
  Snapshots:
    2025-02-13T00:00:21Z 9.69MB <logical> [restore_to_time: 2025-02-13T00:01:47Z]
    2025-02-11T00:00:21Z 9.65MB <logical> [restore_to_time: 2025-02-11T00:01:37Z]

pbm logs

2025-07-09T13:39:28Z E [amfam/psmdb-dev-reports-psm-amfam-0.psmdb-dev-reports-psm-amfam.psmdb-dev-reports.svc.cluster.local:27017] [restore/2025-07-09T13:39:26.191458116Z] restore: failed to ensure snapshot file 2025-02-11T00:00:21Z/amfam/metadata.json: get S3 object header: Forbidden: Forbidden
	status code: 403, request id: 5WHY4MSZWQCM4PAH, host id: GVS2sxMgU2VsabSwVdydbzYyhQbWPMbzvthYBSdcnvk245G3nkrC7KZDOLItQ7AWz9+u5AwXJxI=
2025-07-09T13:39:28Z E [rs0/psmdb-dev-reports-psm-rs0-0.psmdb-dev-reports-psm-rs0.psmdb-dev-reports.svc.cluster.local:27017] [restore/2025-07-09T13:39:26.191458116Z] restore: waiting for start: cluster failed: failed to ensure snapshot file 2025-02-11T00:00:21Z/amfam/metadata.json: get S3 object header: Forbidden: Forbidden
	status code: 403, request id: 5WHY4MSZWQCM4PAH, host id: GVS2sxMgU2VsabSwVdydbzYyhQbWPMbzvthYBSdcnvk245G3nkrC7KZDOLItQ7AWz9+u5AwXJxI=
2025-07-09T13:39:28Z E [cfg/psmdb-dev-reports-psm-cfg-0.psmdb-dev-reports-psm-cfg.psmdb-dev-reports.svc.cluster.local:27017] [restore/2025-07-09T13:39:26.191458116Z] restore: waiting for start: cluster failed: waiting for start: cluster failed: failed to ensure snapshot file 2025-02-11T00:00:21Z/amfam/metadata.json: get S3 object header: Forbidden: Forbidden
	status code: 403, request id: 5WHY4MSZWQCM4PAH, host id: GVS2sxMgU2VsabSwVdydbzYyhQbWPMbzvthYBSdcnvk245G3nkrC7KZDOLItQ7AWz9+u5AwXJxI=

Any thoughts or help would be greatly appreciated.

Thank You

Hi @dclark,

Can you also try a selective restore with all databases:

apiVersion: psmdb.percona.com/v1
kind: PerconaServerMongoDBRestore
spec:
  selective:
    withUsersAndRoles: true
    namespaces:
    - "test1.*"
    - "test2.*"
    - "test3.*"

As for PBM 2.10.0 it started using a newer version of the AWS SDK for Go (v2 instead of v1). But if there was a problem with your S3 or KMS credentials, you’d already see a 403 “storage check failed” in pbm status (or right after running pbm config --force-resync).

Ok, so when I try doing a selective restore it errors with unknown field spec.selective.

error when creating "deploy/backup/restore-logical.yaml": PerconaServerMongoDBRestore in version "v1" cannot be handled as a PerconaServerMongoDBRestore: strict decoding error: unknown field "spec.selective"

Through logging in aws we did find that it is trying to read the bucket using general-eks-node-group. This is not the user that is defined in the secrets file which leads me to believe it is not reading the secrets file or that info is not getting passed to the job that runs the restore.

I’ve also tried specifying in the s3 section the credentialSecret and bucket but same issue. However when running a pbm config on one of the backup nodes it appears to have the creds in there.

restore yaml with selective that fails because of selective;

apiVersion: psmdb.percona.com/v1
kind: PerconaServerMongoDBRestore
metadata:
  name: logical-restore-from-main-1
  namespace: psmdb-dev-reports
spec:
  selective:
    withUsersAndRoles: true
    namespaces:
    - "<customer_1>_data.*"
    - "<customer_2>_data.*"
  clusterName: psmdb-dev-reports-psm
  storageName: s3-us-east-logical
  backupSource:
    type: logical
    destination: s3://<my_bucket>/logical/2025-07-10T14:36:15Z
    s3:
      credentialsSecret: psmdb-dev-reports-psm-backup-s3
      bucket: <my_bucket>

Restore yaml that fails because of error

kind: PerconaServerMongoDBRestore
metadata:
  name: logical-restore-from-main-1
  namespace: psmdb-dev-reports
spec:
  clusterName: psmdb-dev-reports-psm
  storageName: s3-us-east-logical
  backupSource:
    type: logical
    destination: s3://<my_bucket>/logical/2025-07-10T14:36:15Z
    s3:
      credentialsSecret: psmdb-dev-reports-psm-backup-s3
      bucket: <my_bucket>

pbm config as run from a shell on one of the nodes;

storage:
  type: s3
  s3:
    provider: aws
    region: us-east-2
    forcePathStyle: true
    bucket: <my_bucket>
    prefix: logical
    credentials:
      access-key-id: '***'
      secret-access-key: '***'
    serverSideEncryption:
      sseAlgorithm: aws:kms
      kmsKeyID: '<my_key_id>'
      sseCustomerAlgorithm: ""
      sseCustomerKey: ""
    maxUploadParts: 10000
    storageClass: INTELLIGENT_TIERING
    insecureSkipTLSVerify: false
pitr:
  enabled: false
  compression: s2
backup:
  oplogSpanMin: 0
  compression: s2
restore: {}

The only other thing of note is that the operator exists in a different namespace than the mongo cluster. Not sure why that should matter and again all seems to work well with 2.5.0 but something with 2.10.0 is preventing me from pulling in the right creds.

As always, any help that can be provided is much appreciated.

Thank You

Ok some more info. when running kubectl exec -it <pod_name> -n <namespace> -- env in the 2.10.0 deployment, I do not see;

AWS_ACCESS_KEY_ID=…
AWS_SECRET_ACCESS_KEY=…

Looking a bit further, we appear to be using cr version 1.17.0 which may also be causing problems so going to try and upgrade that first and see what happens.

Hi, you should not be updating the backup agent on its own. Always update the operator version instead, as certain versions of components are not supported together. I suggest updating operator to latest version, which will bump the backup agent as well, then retry.

Thanks, I’ve come to that realization. However I am now trying to setup a new reporting cluster based on operator 1.20.1 and hitting all sorts of issues. Is it best to post here or start a new post?

Hi, sorry to hear that. I suggest starting a new post. Also if you have a reproducible error you can open a bug directly at jira.percona.com