Hi everyone,
Description:
I’m encountering an issue during the physical backup restore process on the same Percona cluster.
At some point during the restore, the operator logs indicating a “FULL CLUSTER CRASH”. However, the restore
resource reports that the process completed successfully and shows a Ready
status.
Despite this, all pods in the StatefulSet for my replicaset go offline. After approximately 10–15 minutes, new pods begin to appear, reconnect to their PVCs, and the cluster eventually transitions to a Ready
state with the restored data in place.
So, while the restore technically works, the process is concerning. Restoring just 1MB of data takes around 20 minutes (15 minutes downtime full of Reconciler errors
), which seems excessive. This raises concerns about how long it would take to restore 100GB or 1TB of data under similar conditions.
Has anyone experienced similar behavior? Is the “FULL CLUSTER CRASH” expected during restore, or is there something misconfigured in my setup?
Any insights or suggestions would be greatly appreciated.
Steps to Reproduce:
- run simple cluster:
apiVersion: psmdb.percona.com/v1
kind: PerconaServerMongoDB
metadata:
name: test-cluster
namespace: percona-mongodb
spec:
clusterServiceDNSMode: Internal
crVersion: 1.20.0
image: percona/percona-server-mongodb:7.0.18-11
secrets:
users: test-cluster-secrets
encryptionKey: test-cluster-mongodb-encryption-key
replsets:
- name: rs0
size: 3
terminationGracePeriodSeconds: 600
configuration: |
operationProfiling:
mode: all
slowOpThresholdMs: 100
rateLimit: 10
affinity:
antiAffinityTopologyKey: "kubernetes.io/hostname"
expose:
enabled: true
type: ClusterIP
resources:
limits:
cpu: "300m"
memory: "0.5G"
requests:
cpu: "300m"
memory: "0.5G"
volumeSpec:
persistentVolumeClaim:
resources:
requests:
storage: 3Gi
sharding:
enabled: false
backup:
enabled: true
image: percona/percona-backup-mongodb:2.9.1
storages:
s3-storage:
type: s3
s3:
bucket: percona-mongodb-backups
region: us-east-1
prefix: "test-cluster"
tasks:
- name: test-cluster-hourly-physical-backup
enabled: true
schedule: "0 * * * *"
keep: 3
storageName: s3-storage
compressionType: gzip
compressionLevel: 6
type: physical
pmm:
enabled: false
image: percona/pmm-client:2.44.1
- wait for the first available backup
- run restore:
apiVersion: psmdb.percona.com/v1
kind: PerconaServerMongoDBRestore
metadata:
name: restore-test-cluster-physical
spec:
clusterName: test-cluster
backupName: cron-test-cluster-<backup-name>
Version:
Operator: 1.20.1
MongoDB: 7.0.18-11
PBM: 2.9.1
Logs:
Hi, the forum is not the best place to report errors. In any case, I see you mention operator version 1.20.1, but in the cr you are actually using 1.20.0.
We fixed an issue related to backup/restore process in 1.20.1 which you might be hitting.
Please try 1.20.1 and if you still run into the issue, please open a bug report at Jira - Percona JIRA
1 Like
Hi @Stateros,
Full cluster crash recovery after physical restore is expected, because replset was already initialized but all pods went down after the restore. Although you’re right that 10 minutes for recovery is not normal, bu I can’t say what went wrong without knowing the Kubernetes cluster state after the restore.
Also note that, physical restores require some preparation in mongo pods. This preparation requires a rollout restart and that takes some time too.
1 Like
@Ivan_Groenewold @Ege_Gunes — thanks for your comments!
You were right about the version issue — that was my mistake. I had upgraded the operator but forgot to update the crVersion
. I’ve corrected that, taken a new backup, and re-ran the restore.
The result is the same: the restore process completes, but immediately afterward, all three pods enter a Terminating
state and remain there for about 10 minutes. I assume this duration corresponds to the terminationGracePeriodSeconds: 600
setting.
My question is: if the “Full cluster crash” is expected - why couldn’t mongod
terminate gracefully within the grace period? And how should the grace period relate to the dataset size?
~ k describe pod test-cluster-rs0-0 -n percona-mongodb
Name: test-cluster-rs0-0
Namespace: percona-mongodb
Priority: 0
Node: ...
Start Time: Fri, 04 Jul 2025 11:19:10 -0400
Labels: app.kubernetes.io/component=mongod
app.kubernetes.io/instance=test-cluster
app.kubernetes.io/managed-by=percona-server-mongodb-operator
app.kubernetes.io/name=percona-server-mongodb
app.kubernetes.io/part-of=percona-server-mongodb
app.kubernetes.io/replset=rs0
apps.kubernetes.io/pod-index=0
controller-revision-hash=test-cluster-rs0-5bb58c446b
statefulset.kubernetes.io/pod-name=test-cluster-rs0-0
Annotations: percona.com/configuration-hash: abc3e579ffb3654cbba28f7432d503d5
percona.com/ssl-hash: 78d4558044a9ca70aa73efa2795c5226
percona.com/ssl-internal-hash: 89d18f7ea0c0f5e3ca1278de1de31de3
Status: Terminating (lasts <invalid>)
Termination Grace Period: 600s
IP: ...
IPs:
IP: ...
Controlled By: StatefulSet/test-cluster-rs0
Init Containers:
mongo-init:
Container ID: containerd://b9206c6e6eab87b8a2a02e151acce1138e788d00cb986535a15eee2d9012a73c
Image: percona/percona-server-mongodb-operator:1.20.1
Image ID: docker.io/percona/percona-server-mongodb-operator@sha256:d09453ce7886818edc1a808afbe600033d5eb6d6110c4e18cfd0e240b86bfb16
Port: <none>
Host Port: <none>
Command:
/init-entrypoint.sh
State: Terminated
Reason: Completed
Exit Code: 0
Started: Fri, 04 Jul 2025 11:19:17 -0400
Finished: Fri, 04 Jul 2025 11:19:17 -0400
Ready: True
Restart Count: 0
Limits:
cpu: 300m
memory: 500M
Requests:
cpu: 300m
memory: 500M
Environment:
AWS_STS_REGIONAL_ENDPOINTS: regional
...
Mounts:
/data/db from mongod-data (rw)
/opt/percona from bin (rw)
/var/run/secrets/eks.amazonaws.com/serviceaccount from aws-iam-token (ro)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-qsrxw (ro)
pbm-init:
Container ID: containerd://bbd869da8ec2c34db24226cdc0f896eb66a9176ea5772d292c3d45acf6d27bda
Image: percona/percona-backup-mongodb:2.9.1
Image ID: docker.io/percona/percona-backup-mongodb@sha256:925baa9db7b467d8ec3214d32665eb0fb41e6891d960bf5720a37091ecac43ab
Port: <none>
Host Port: <none>
Command:
bash
-c
install -D /usr/bin/pbm /opt/percona/pbm && install -D /usr/bin/pbm-agent /opt/percona/pbm-agent
State: Terminated
Reason: Completed
Exit Code: 0
Started: Fri, 04 Jul 2025 11:19:18 -0400
Finished: Fri, 04 Jul 2025 11:19:18 -0400
Ready: True
Restart Count: 0
Environment:
AWS_STS_REGIONAL_ENDPOINTS: regional
...
Mounts:
/data/db from mongod-data (rw)
/opt/percona from bin (rw)
/var/run/secrets/eks.amazonaws.com/serviceaccount from aws-iam-token (ro)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-qsrxw (ro)
Containers:
mongod:
Container ID: containerd://96f22ad019e2b57ff0b8e4806531fe73965a0ac51277b7d0b6cbdd8ead4e1f08
Image: percona/percona-server-mongodb:7.0.18-11
Image ID: docker.io/percona/percona-server-mongodb@sha256:24377a18737fe71a5f9050811017ea423196f8edfb8af6db68f877397e36719a
Port: 27017/TCP
Host Port: 0/TCP
Command:
/opt/percona/physical-restore-ps-entry.sh
Args:
--bind_ip_all
--auth
--dbpath=/data/db
--port=27017
--replSet=rs0
--storageEngine=wiredTiger
--relaxPermChecks
--sslAllowInvalidCertificates
--clusterAuthMode=x509
--tlsMode=preferTLS
--enableEncryption
--encryptionKeyFile=/etc/mongodb-encryption/encryption-key
--wiredTigerCacheSizeGB=0.25
--wiredTigerIndexPrefixCompression=true
--config=/etc/mongodb-config/mongod.conf
--quiet
State: Running
Started: Fri, 04 Jul 2025 11:19:19 -0400
Ready: True
Restart Count: 0
Limits:
cpu: 300m
memory: 500M
Requests:
cpu: 300m
memory: 500M
Liveness: exec [/opt/percona/mongodb-healthcheck k8s liveness --ssl --sslInsecure --sslCAFile /etc/mongodb-ssl/ca.crt --sslPEMKeyFile /tmp/tls.pem --startupDelaySeconds 7200] delay=60s timeout=10s period=30s #success=1 #failure=4
Readiness: exec [/opt/percona/mongodb-healthcheck k8s readiness --component mongod] delay=10s timeout=2s period=3s #success=1 #failure=8
Environment Variables from:
internal-test-cluster-users Secret Optional: false
Environment:
SERVICE_NAME: test-cluster
NAMESPACE: percona-mongodb
MONGODB_PORT: 27017
MONGODB_REPLSET: rs0
PBM_AGENT_MONGODB_USERNAME: <set to the key 'MONGODB_BACKUP_USER_ESCAPED' in secret 'internal-test-cluster-users'> Optional: false
PBM_AGENT_MONGODB_PASSWORD: <set to the key 'MONGODB_BACKUP_PASSWORD_ESCAPED' in secret 'internal-test-cluster-users'> Optional: false
PBM_AGENT_SIDECAR: true
PBM_AGENT_SIDECAR_SLEEP: 5
POD_NAME: test-cluster-rs0-0 (v1:metadata.name)
PBM_MONGODB_URI: mongodb://$(PBM_AGENT_MONGODB_USERNAME):$(PBM_AGENT_MONGODB_PASSWORD)@$(POD_NAME)
AWS_STS_REGIONAL_ENDPOINTS: regional
...
Mounts:
/data/db from mongod-data (rw)
/etc/mongodb-config from config (rw)
/etc/mongodb-encryption from test-cluster-mongodb-encryption-key (ro)
/etc/mongodb-secrets from test-cluster-mongodb-keyfile (ro)
/etc/mongodb-ssl from ssl (ro)
/etc/mongodb-ssl-internal from ssl-internal (ro)
/etc/pbm/ from pbm-config (ro)
/etc/users-secret from users-secret-file (rw)
/opt/percona from bin (rw)
/var/run/secrets/eks.amazonaws.com/serviceaccount from aws-iam-token (ro)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-qsrxw (ro)
Conditions:
Type Status
PodReadyToStartContainers True
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
aws-iam-token:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 86400
mongod-data:
Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
ClaimName: mongod-data-test-cluster-rs0-0
ReadOnly: false
test-cluster-mongodb-keyfile:
Type: Secret (a volume populated by a Secret)
SecretName: test-cluster-mongodb-keyfile
Optional: false
bin:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
config:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: test-cluster-rs0-mongod
Optional: true
test-cluster-mongodb-encryption-key:
Type: Secret (a volume populated by a Secret)
SecretName: test-cluster-mongodb-encryption-key
Optional: false
ssl:
Type: Secret (a volume populated by a Secret)
SecretName: test-cluster-ssl
Optional: false
ssl-internal:
Type: Secret (a volume populated by a Secret)
SecretName: test-cluster-ssl-internal
Optional: true
users-secret-file:
Type: Secret (a volume populated by a Secret)
SecretName: internal-test-cluster-users
Optional: false
pbm-config:
Type: Secret (a volume populated by a Secret)
SecretName: test-cluster-pbm-config
Optional: false
kube-api-access-qsrxw:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: Burstable
Node-Selectors: dp/workload_type=mongodb-amd64
karpenter.k8s.aws/instance-family=m7i
karpenter.k8s.aws/instance-size=2xlarge
Tolerations: dp/workload_type=mongodb-amd64:NoSchedule
node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 8m26s default-scheduler Successfully assigned percona-mongodb/test-cluster-rs0-0 to ip-10-1-83-77.eu-central-1.compute.internal
Normal Pulled 8m19s kubelet Successfully pulled image "percona/percona-server-mongodb-operator:1.20.1" in 60ms (60ms including waiting). Image size: 72178894 bytes.
Normal Created 8m19s kubelet Created container: mongo-init
Normal Started 8m19s kubelet Started container mongo-init
Normal Pulling 8m19s kubelet Pulling image "percona/percona-server-mongodb-operator:1.20.1"
Normal Started 8m18s kubelet Started container pbm-init
Normal Pulling 8m18s kubelet Pulling image "percona/percona-backup-mongodb:2.9.1"
Normal Pulled 8m18s kubelet Successfully pulled image "percona/percona-backup-mongodb:2.9.1" in 25ms (25ms including waiting). Image size: 113132493 bytes.
Normal Created 8m18s kubelet Created container: pbm-init
Normal Pulling 8m17s kubelet Pulling image "percona/percona-server-mongodb:7.0.18-11"
Normal Pulled 8m17s kubelet Successfully pulled image "percona/percona-server-mongodb:7.0.18-11" in 29ms (29ms including waiting). Image size: 273728402 bytes.
Normal Created 8m17s kubelet Created container: mongod
Normal Started 8m17s kubelet Started container mongod
Normal Killing 5m41s kubelet Stopping container mongod
If you think it’s better to create a Jira ticket for it - I will be happy to do it.