"FULL CLUSTER CRASH" during restore physical backup

Stateros · July 3, 2025, 5:44pm

Hi everyone,

Description:

I’m encountering an issue during the physical backup restore process on the same Percona cluster.

At some point during the restore, the operator logs indicating a “FULL CLUSTER CRASH”. However, the restore resource reports that the process completed successfully and shows a Ready status.

Despite this, all pods in the StatefulSet for my replicaset go offline. After approximately 10–15 minutes, new pods begin to appear, reconnect to their PVCs, and the cluster eventually transitions to a Ready state with the restored data in place.

So, while the restore technically works, the process is concerning. Restoring just 1MB of data takes around 20 minutes (15 minutes downtime full of Reconciler errors), which seems excessive. This raises concerns about how long it would take to restore 100GB or 1TB of data under similar conditions.

Has anyone experienced similar behavior? Is the “FULL CLUSTER CRASH” expected during restore, or is there something misconfigured in my setup?

Any insights or suggestions would be greatly appreciated.

Steps to Reproduce:

run simple cluster:

apiVersion: psmdb.percona.com/v1
kind: PerconaServerMongoDB
metadata:
  name: test-cluster
  namespace: percona-mongodb
spec:
  clusterServiceDNSMode: Internal
  crVersion: 1.20.0
  image: percona/percona-server-mongodb:7.0.18-11
  secrets:
    users: test-cluster-secrets
    encryptionKey: test-cluster-mongodb-encryption-key
  replsets:
    - name: rs0
      size: 3
      terminationGracePeriodSeconds: 600
      configuration: |
        operationProfiling:
          mode: all
          slowOpThresholdMs: 100
          rateLimit: 10
      affinity:
        antiAffinityTopologyKey: "kubernetes.io/hostname"
      expose:
        enabled: true
        type: ClusterIP
      resources:
        limits:
          cpu: "300m"
          memory: "0.5G"
        requests:
          cpu: "300m"
          memory: "0.5G"
      volumeSpec:
        persistentVolumeClaim:
          resources:
            requests:
              storage: 3Gi
  sharding:
    enabled: false

  backup:
    enabled: true
    image: percona/percona-backup-mongodb:2.9.1
    storages:
      s3-storage:
        type: s3
        s3:
          bucket: percona-mongodb-backups
          region: us-east-1
          prefix: "test-cluster"
    tasks:
      - name: test-cluster-hourly-physical-backup
        enabled: true
        schedule: "0 * * * *"
        keep: 3
        storageName: s3-storage
        compressionType: gzip
        compressionLevel: 6
        type: physical
  pmm:
    enabled: false
    image: percona/pmm-client:2.44.1

wait for the first available backup
run restore:

apiVersion: psmdb.percona.com/v1
kind: PerconaServerMongoDBRestore
metadata:
  name: restore-test-cluster-physical
spec:
  clusterName: test-cluster
  backupName: cron-test-cluster-<backup-name>

Version:

Operator: 1.20.1
MongoDB: 7.0.18-11
PBM: 2.9.1

Logs:

Ivan_Groenewold · July 4, 2025, 11:06am

Hi, the forum is not the best place to report errors. In any case, I see you mention operator version 1.20.1, but in the cr you are actually using 1.20.0.
We fixed an issue related to backup/restore process in 1.20.1 which you might be hitting.
Please try 1.20.1 and if you still run into the issue, please open a bug report at Jira - Percona JIRA

Ege_Gunes · July 4, 2025, 11:34am

Hi @Stateros,

Full cluster crash recovery after physical restore is expected, because replset was already initialized but all pods went down after the restore. Although you’re right that 10 minutes for recovery is not normal, bu I can’t say what went wrong without knowing the Kubernetes cluster state after the restore.

Also note that, physical restores require some preparation in mongo pods. This preparation requires a rollout restart and that takes some time too.

Stateros · July 4, 2025, 3:46pm

@Ivan_Groenewold @Ege_Gunes — thanks for your comments!

You were right about the version issue — that was my mistake. I had upgraded the operator but forgot to update the crVersion. I’ve corrected that, taken a new backup, and re-ran the restore.

The result is the same: the restore process completes, but immediately afterward, all three pods enter a Terminating state and remain there for about 10 minutes. I assume this duration corresponds to the terminationGracePeriodSeconds: 600 setting.

My question is: if the “Full cluster crash” is expected - why couldn’t mongod terminate gracefully within the grace period? And how should the grace period relate to the dataset size?

~ k describe pod test-cluster-rs0-0 -n percona-mongodb                                                                                                                                                                                                                                                                                  

Name:                      test-cluster-rs0-0
Namespace:                 percona-mongodb
Priority:                  0
Node:                      ...
Start Time:                Fri, 04 Jul 2025 11:19:10 -0400
Labels:                    app.kubernetes.io/component=mongod
                           app.kubernetes.io/instance=test-cluster
                           app.kubernetes.io/managed-by=percona-server-mongodb-operator
                           app.kubernetes.io/name=percona-server-mongodb
                           app.kubernetes.io/part-of=percona-server-mongodb
                           app.kubernetes.io/replset=rs0
                           apps.kubernetes.io/pod-index=0
                           controller-revision-hash=test-cluster-rs0-5bb58c446b
                           statefulset.kubernetes.io/pod-name=test-cluster-rs0-0
Annotations:               percona.com/configuration-hash: abc3e579ffb3654cbba28f7432d503d5
                           percona.com/ssl-hash: 78d4558044a9ca70aa73efa2795c5226
                           percona.com/ssl-internal-hash: 89d18f7ea0c0f5e3ca1278de1de31de3
Status:                    Terminating (lasts <invalid>)
Termination Grace Period:  600s
IP:                        ...
IPs:
  IP:           ...
Controlled By:  StatefulSet/test-cluster-rs0
Init Containers:
  mongo-init:
    Container ID:  containerd://b9206c6e6eab87b8a2a02e151acce1138e788d00cb986535a15eee2d9012a73c
    Image:         percona/percona-server-mongodb-operator:1.20.1
    Image ID:      docker.io/percona/percona-server-mongodb-operator@sha256:d09453ce7886818edc1a808afbe600033d5eb6d6110c4e18cfd0e240b86bfb16
    Port:          <none>
    Host Port:     <none>
    Command:
      /init-entrypoint.sh
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Fri, 04 Jul 2025 11:19:17 -0400
      Finished:     Fri, 04 Jul 2025 11:19:17 -0400
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:     300m
      memory:  500M
    Requests:
      cpu:     300m
      memory:  500M
    Environment:
      AWS_STS_REGIONAL_ENDPOINTS:   regional
      ...
    Mounts:
      /data/db from mongod-data (rw)
      /opt/percona from bin (rw)
      /var/run/secrets/eks.amazonaws.com/serviceaccount from aws-iam-token (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-qsrxw (ro)
  pbm-init:
    Container ID:  containerd://bbd869da8ec2c34db24226cdc0f896eb66a9176ea5772d292c3d45acf6d27bda
    Image:         percona/percona-backup-mongodb:2.9.1
    Image ID:      docker.io/percona/percona-backup-mongodb@sha256:925baa9db7b467d8ec3214d32665eb0fb41e6891d960bf5720a37091ecac43ab
    Port:          <none>
    Host Port:     <none>
    Command:
      bash
      -c
      install -D /usr/bin/pbm /opt/percona/pbm && install -D /usr/bin/pbm-agent /opt/percona/pbm-agent
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Fri, 04 Jul 2025 11:19:18 -0400
      Finished:     Fri, 04 Jul 2025 11:19:18 -0400
    Ready:          True
    Restart Count:  0
    Environment:
      AWS_STS_REGIONAL_ENDPOINTS:   regional
      ...
    Mounts:
      /data/db from mongod-data (rw)
      /opt/percona from bin (rw)
      /var/run/secrets/eks.amazonaws.com/serviceaccount from aws-iam-token (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-qsrxw (ro)
Containers:
  mongod:
    Container ID:  containerd://96f22ad019e2b57ff0b8e4806531fe73965a0ac51277b7d0b6cbdd8ead4e1f08
    Image:         percona/percona-server-mongodb:7.0.18-11
    Image ID:      docker.io/percona/percona-server-mongodb@sha256:24377a18737fe71a5f9050811017ea423196f8edfb8af6db68f877397e36719a
    Port:          27017/TCP
    Host Port:     0/TCP
    Command:
      /opt/percona/physical-restore-ps-entry.sh
    Args:
      --bind_ip_all
      --auth
      --dbpath=/data/db
      --port=27017
      --replSet=rs0
      --storageEngine=wiredTiger
      --relaxPermChecks
      --sslAllowInvalidCertificates
      --clusterAuthMode=x509
      --tlsMode=preferTLS
      --enableEncryption
      --encryptionKeyFile=/etc/mongodb-encryption/encryption-key
      --wiredTigerCacheSizeGB=0.25
      --wiredTigerIndexPrefixCompression=true
      --config=/etc/mongodb-config/mongod.conf
      --quiet
    State:          Running
      Started:      Fri, 04 Jul 2025 11:19:19 -0400
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:     300m
      memory:  500M
    Requests:
      cpu:      300m
      memory:   500M
    Liveness:   exec [/opt/percona/mongodb-healthcheck k8s liveness --ssl --sslInsecure --sslCAFile /etc/mongodb-ssl/ca.crt --sslPEMKeyFile /tmp/tls.pem --startupDelaySeconds 7200] delay=60s timeout=10s period=30s #success=1 #failure=4
    Readiness:  exec [/opt/percona/mongodb-healthcheck k8s readiness --component mongod] delay=10s timeout=2s period=3s #success=1 #failure=8
    Environment Variables from:
      internal-test-cluster-users  Secret  Optional: false
    Environment:
      SERVICE_NAME:                 test-cluster
      NAMESPACE:                    percona-mongodb
      MONGODB_PORT:                 27017
      MONGODB_REPLSET:              rs0
      PBM_AGENT_MONGODB_USERNAME:   <set to the key 'MONGODB_BACKUP_USER_ESCAPED' in secret 'internal-test-cluster-users'>      Optional: false
      PBM_AGENT_MONGODB_PASSWORD:   <set to the key 'MONGODB_BACKUP_PASSWORD_ESCAPED' in secret 'internal-test-cluster-users'>  Optional: false
      PBM_AGENT_SIDECAR:            true
      PBM_AGENT_SIDECAR_SLEEP:      5
      POD_NAME:                     test-cluster-rs0-0 (v1:metadata.name)
      PBM_MONGODB_URI:              mongodb://$(PBM_AGENT_MONGODB_USERNAME):$(PBM_AGENT_MONGODB_PASSWORD)@$(POD_NAME)
      AWS_STS_REGIONAL_ENDPOINTS:   regional
      ...
    Mounts:
      /data/db from mongod-data (rw)
      /etc/mongodb-config from config (rw)
      /etc/mongodb-encryption from test-cluster-mongodb-encryption-key (ro)
      /etc/mongodb-secrets from test-cluster-mongodb-keyfile (ro)
      /etc/mongodb-ssl from ssl (ro)
      /etc/mongodb-ssl-internal from ssl-internal (ro)
      /etc/pbm/ from pbm-config (ro)
      /etc/users-secret from users-secret-file (rw)
      /opt/percona from bin (rw)
      /var/run/secrets/eks.amazonaws.com/serviceaccount from aws-iam-token (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-qsrxw (ro)
Conditions:
  Type                        Status
  PodReadyToStartContainers   True
  Initialized                 True
  Ready                       True
  ContainersReady             True
  PodScheduled                True
Volumes:
  aws-iam-token:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  86400
  mongod-data:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  mongod-data-test-cluster-rs0-0
    ReadOnly:   false
  test-cluster-mongodb-keyfile:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  test-cluster-mongodb-keyfile
    Optional:    false
  bin:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
    SizeLimit:  <unset>
  config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      test-cluster-rs0-mongod
    Optional:  true
  test-cluster-mongodb-encryption-key:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  test-cluster-mongodb-encryption-key
    Optional:    false
  ssl:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  test-cluster-ssl
    Optional:    false
  ssl-internal:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  test-cluster-ssl-internal
    Optional:    true
  users-secret-file:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  internal-test-cluster-users
    Optional:    false
  pbm-config:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  test-cluster-pbm-config
    Optional:    false
  kube-api-access-qsrxw:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              dp/workload_type=mongodb-amd64
                             karpenter.k8s.aws/instance-family=m7i
                             karpenter.k8s.aws/instance-size=2xlarge
Tolerations:                 dp/workload_type=mongodb-amd64:NoSchedule
                             node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type    Reason     Age    From               Message
  ----    ------     ----   ----               -------
  Normal  Scheduled  8m26s  default-scheduler  Successfully assigned percona-mongodb/test-cluster-rs0-0 to ip-10-1-83-77.eu-central-1.compute.internal
  Normal  Pulled     8m19s  kubelet            Successfully pulled image "percona/percona-server-mongodb-operator:1.20.1" in 60ms (60ms including waiting). Image size: 72178894 bytes.
  Normal  Created    8m19s  kubelet            Created container: mongo-init
  Normal  Started    8m19s  kubelet            Started container mongo-init
  Normal  Pulling    8m19s  kubelet            Pulling image "percona/percona-server-mongodb-operator:1.20.1"
  Normal  Started    8m18s  kubelet            Started container pbm-init
  Normal  Pulling    8m18s  kubelet            Pulling image "percona/percona-backup-mongodb:2.9.1"
  Normal  Pulled     8m18s  kubelet            Successfully pulled image "percona/percona-backup-mongodb:2.9.1" in 25ms (25ms including waiting). Image size: 113132493 bytes.
  Normal  Created    8m18s  kubelet            Created container: pbm-init
  Normal  Pulling    8m17s  kubelet            Pulling image "percona/percona-server-mongodb:7.0.18-11"
  Normal  Pulled     8m17s  kubelet            Successfully pulled image "percona/percona-server-mongodb:7.0.18-11" in 29ms (29ms including waiting). Image size: 273728402 bytes.
  Normal  Created    8m17s  kubelet            Created container: mongod
  Normal  Started    8m17s  kubelet            Started container mongod
  Normal  Killing    5m41s  kubelet            Stopping container mongod

If you think it’s better to create a Jira ticket for it - I will be happy to do it.

Topic		Replies	Views
Physical restore never finish Percona Backup for MongoDB mongodb , kubernetes , psmdb-operator , pbm	3	180	April 13, 2025
Error during an attempt to restore a physical backup of MongoDB Percona Operator for MongoDB	3	97	March 20, 2025
Error while trying backup: check cluster for dump done: convergeCluster: lost shard rs0, last beat ts: Percona Operator for MongoDB closed-no-reply , pbm	5	877	November 9, 2024
Percona MongoDB in crash loop Percona Operator for MongoDB	11	629	April 19, 2024
Unable to restore a physical backup to a `PerconaServerMongodb` replicaset Percona Operator for MongoDB	1	579	December 18, 2023

"FULL CLUSTER CRASH" during restore physical backup

Description:

Steps to Reproduce:

Version:

Logs:

Related topics