Hey all,
So I was having issues with the logical restores in K8s per the post here. I have since got the helm chart 1.20.1 up and running with the operator and pbm agent 2.10.0. This did not resolve the issue per this post. It appears that it is still an issue with a multi-replicaset setup and mongodb > 6.x.
As a result, I have started looking into physical restores, will likely be necessary in production anyway since they should be significantly faster. However I am seeing some very odd behavior when attempting the physical restores.
Essentially, I kick off the restore with a cr.yaml file;
apiVersion: psmdb.percona.com/v1
kind: PerconaServerMongoDBRestore
metadata:
name: physical-restore-from-main-1
namespace: psmdb-dev-reports
spec:
clusterName: psmdb-dev-reports-psm
storageName: s3-us-east-physical
backupSource:
type: physical
destination: s3://<my_bucket>/physical/2025-07-14T20:50:22Z
s3:
credentialsSecret: psmdb-backup-s3
bucket: <my_bucket>
Once I kick it off with kubectl apply -f deploy/backup/restore-physical.yaml -n psmdb-dev-reports
, it sits for a while, kubectl get psmdb-restore -n psmdb-dev-reports
without a status change. During this time, the operator generates no logs nor the pbm agent. After about 5 minutes the operator will kill the mongos instances, as is to be expected.
After this there is still no status change and no further logs in the operator, still no logs in pbm agent with pbm logs -f
. I assume it is doing the file restore in the background. After about another 5 minutes, the operator starts throwing logs about Waiting for statefulsets to be ready before restore. Similar to this;
2025-07-15T18:17:50.726Z INFO Waiting for statefulsets to be ready before restore {"controller": "psmdbrestore-controller", "controllerGroup": "psmdb.percona.com", "contr
ollerKind": "PerconaServerMongoDBRestore", "PerconaServerMongoDBRestore": {"name":"physical-restore-from-main-1","namespace":"psmdb-dev-reports"}, "namespace": "psmdb-dev-reports"
, "name": "physical-restore-from-main-1", "reconcileID": "7973623e-ac14-4087-b2d5-ada7cdd37e59", "ready": false}
2025-07-15T18:17:55.727Z INFO Waiting for statefulsets to be ready before restore {"controller": "psmdbrestore-controller", "controllerGroup": "psmdb.percona.com", "contr
ollerKind": "PerconaServerMongoDBRestore", "PerconaServerMongoDBRestore": {"name":"physical-restore-from-main-1","namespace":"psmdb-dev-reports"}, "namespace": "psmdb-dev-reports"
, "name": "physical-restore-from-main-1", "reconcileID": "8322868f-7960-4800-b7a3-6120ef2863c7", "ready": false}
2025-07-15T18:18:53.902Z INFO SmartUpdate apply changes to secondary pod {"controller": "psmdb-controller", "controllerGroup": "psmdb.percona.com", "controllerKind": "
PerconaServerMongoDB", "PerconaServerMongoDB": {"name":"psmdb-dev-reports-psm","namespace":"psmdb-dev-reports"}, "namespace": "psmdb-dev-reports", "name": "psmdb-dev-reports-psm",
"reconcileID": "6c946aa9-f1dc-457d-8f2a-65d963d1310e", "statefulset": "psmdb-dev-reports-psm-amfam", "replset": "amfam", "pod": "psmdb-dev-reports-psm-amfam-1"}
2025-07-15T18:19:34.797Z INFO Pod started {"controller": "psmdb-controller", "controllerGroup": "psmdb.percona.com", "controllerKind": "PerconaServerMongoDB", "PerconaSer
verMongoDB": {"name":"psmdb-dev-reports-psm","namespace":"psmdb-dev-reports"}, "namespace": "psmdb-dev-reports", "name": "psmdb-dev-reports-psm", "reconcileID": "6c946aa9-f1dc-457
d-8f2a-65d963d1310e", "pod": "psmdb-dev-reports-psm-amfam-1"}
At this point it starts restarting all of the pods, which I also think is to be expected. However, when the pods come back online, they are missing the backup-agent
agent container. This results in the backup eventually failing stating that there are no pbm agents available for the restore. I was following pbm logs up to the point of the the pod being restarted and nothing was ever posted to it. I did however see a few errors in admin.pbmlog for each node of the RS noting msg: 'mark error during restore: check mongod binary: run: exec: "mongod": executable file not found in $PATH. stderr: '
. Not sure if these were related to the backup though as I was only able to look after the pods were restarted and mongos came back up and I could connect.
The other really odd thing is that I have tried deleting pods, patching the helm chart etc. in order to get the pbm-agent container back. The only two things that seem to work are deletion of the stateful sets, or a complete terragrunt destroy/apply.
We are running EKS on AWS currently v1.32, helm chart for percona is 1.20.1, physical restore attempted with backup-agent 2.10.0 and 2.9.1. It should also be noted that I have tried numerous variations of the restore yaml. Including removing the s3 section as all of that is defined in the storage name. Also credentialsSecret: psmdb-backup-s3
is there with the correct AWS secrets.
Helm values for the mongodb deploy;
# Cluster DNS Suffix
# clusterServiceDNSSuffix: svc.cluster.local
# clusterServiceDNSMode: "Internal"
finalizers:
- percona.com/delete-psmdb-pods-in-order
nameOverride: ""
fullnameOverride: ""
crVersion: 1.20.1
pause: false
unmanaged: false
unsafeFlags:
tls: false
replsetSize: false
mongosSize: false
terminationGracePeriod: false
backupIfUnhealthy: false
enableVolumeExpansion: true
annotations: {}
multiCluster:
enabled: false
updateStrategy: SmartUpdate
upgradeOptions:
versionServiceEndpoint: https://check.percona.com
apply: disabled
schedule: "0 2 * * *"
setFCV: false
image:
repository: percona/percona-server-mongodb
tag: 7.0.12-7
# tag: 7.0.18-11
imagePullPolicy: Always
secrets:
encryptionKey: psmdb-encryption-key
users: psmdb-users-secrets
pmm:
enabled: true
image:
repository: percona/pmm-client
tag: 2.44.1
serverHost: pmm.override.in.terraform.threatx.io
replsets:
rs0:
name: rs0
size: 3
affinity:
antiAffinityTopologyKey: "kubernetes.io/hostname"
podDisruptionBudget:
maxUnavailable: 1
expose:
enabled: false
type: ClusterIP
resources:
limits:
cpu: "4"
memory: "4G"
requests:
cpu: "1"
memory: "1G"
volumeSpec:
pvc:
storageClassName: gp3-mongo-standard-unencrypted
resources:
requests:
storage: 25Gi
amfam:
name: amfam
size: 3
affinity:
antiAffinityTopologyKey: "kubernetes.io/hostname"
podDisruptionBudget:
maxUnavailable: 1
expose:
enabled: false
type: ClusterIP
resources:
limits:
cpu: "4"
memory: "4G"
requests:
cpu: "1"
memory: "1G"
volumeSpec:
pvc:
storageClassName: gp3-mongo-standard-unencrypted
resources:
requests:
storage: 25Gi
sharding:
enabled: true
balancer:
enabled: true
configrs:
size: 3
affinity:
antiAffinityTopologyKey: "kubernetes.io/hostname"
podDisruptionBudget:
maxUnavailable: 1
expose:
enabled: false
type: ClusterIP
resources:
limits:
cpu: "1"
memory: "1G"
requests:
cpu: "300m"
memory: "0.5G"
volumeSpec:
pvc:
storageClassName: gp3-mongo-standard-unencrypted
resources:
requests:
storage: 3Gi
mongos:
size: 3
affinity:
antiAffinityTopologyKey: "kubernetes.io/hostname"
podDisruptionBudget:
maxUnavailable: 1
resources:
limits:
cpu: "4"
memory: "4G"
requests:
cpu: "1"
memory: "1G"
expose:
enabled: true
exposeType: LoadBalancer
servicePerPod: true
annotations:
service.beta.kubernetes.io/aws-load-balancer-type: "external"
service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: ip
backup:
enabled: true
image:
repository: percona/percona-backup-mongodb
tag: 2.9.1
resources:
limits:
cpu: "3"
memory: "3G"
requests:
cpu: "1"
memory: "1G"
storages:
s3-us-east-logical:
# main: true
type: s3
s3:
bucket: derrived-from-terraform
credentialsSecret: psmdb-backup-s3
serverSideEncryption:
kmsKeyID: derrived-from-terraform
sseAlgorithm: aws:kms
region: us-east-2
prefix: "logical"
storageClass: INTELLIGENT_TIERING
s3-us-east-physical:
main: true
type: s3
s3:
bucket: derrived-from-terraform
credentialsSecret: psmdb-backup-s3
serverSideEncryption:
kmsKeyID: derrived-from-terraform
sseAlgorithm: aws:kms
region: us-east-2
prefix: "physical"
storageClass: INTELLIGENT_TIERING
pitr:
enabled: false
oplogOnly: false
tasks:
Happy to provide any additional information if I can provide it. If anyone has any thoughts it would be greatly appreciated.