Physical restore never finish

Hi everyone

Here is my configuration of a test cluster for backup/restore/update:

Kubernetes : 1.29.10
Helm chart psmdb-operator : 1.18.0 (1.19.0 impossible to deploy)
Helm Chart psmdb-db : 1.18.0 (1.19.0 impossible to deploy)
Mongod : 6.0.19-16-multi
Operator : 1.18.0
PBM : 2.7.0-multi (2.8.0-multi error with index for restore)

In total 9 pods for 1 replicatset : 1 operator, 3 cfg, 3 rs0, 2 mongos router

So I want to do a complete restore of my mongos dbs.
For that I want to do it from a complete physical backup that is made, existing of 22Gio.

But if I launch a restoration with manifest K8s, it works at the beginning then after it never stops, I left running 24 hours and still the status of the restore in running!

Here is the manifest k8s file that I used:

apiVersion: psmdb.percona.com/v1
kind: PerconaServerMongoDBRestore
metadata:
  name: xxxxx-restaure-physical-full-test-1
spec:
  clusterName: "psmdb-db-test"
  backupSource:
    type: "physical"
    destination: "s3://xxxxx-mongo-backup-physical-test/2025-02-04T05:30:21Z"
    s3:
      credentialsSecret: "pbm-mongodb-xxxxxx-xx"
      region: "fr-par"
      bucket: "xxxxx-mongo-backup-physical-test"
      endpointUrl: "https://s3.fr-par.xxx.xxxx"

And here is what happens after applying the manifest with kubectl:

  1. The restore is in status waiting
  2. The PBM containers of each pod stop
  3. The router mongos stop both
  4. The restore is in status running
  5. The restoration begins …

The kubectl describe my-restore:

Name:         xxxxx-restaure-physical-full-test-1
Namespace:    percona-mongodb-test
Labels:       <none>
Annotations:  <none>
API Version:  psmdb.percona.com/v1
Kind:         PerconaServerMongoDBRestore
Metadata:
  Creation Timestamp:  2025-02-04T16:53:06Z
  Generation:          1
  Resource Version:    41116260131
  UID:                 ed9749e0-6850-4b6f-b757-f81043750ed3
Spec:
  Backup Source:
    Destination:  s3://xxxxx-mongo-backup-physical-test/2025-02-04T05:30:21Z
    s3:
      Bucket:              xxxxx-mongo-backup-physical-test
      Credentials Secret:  pbm-mongodb-xxxxxx-xx
      Endpoint URL:        https://s3.fr-par.xxx.xxxx
      Region:              fr-par
    Type:                  physical
  Cluster Name:            psmdb-db-xxxxx
Status:
  Pbm Name:  2025-02-04T17:08:15.142876273Z
  State:     running
Events:      <none>

And now the log of psmdb-operator :

2025-02-05T09:31:11.911Z	DEBUG	PBM restore status	{"controller": "psmdbrestore-controller", "object": {"name":"xxxxx-restaure-physical-full-test-1","namespace":"percona-mongodb-test"}, "namespace": "percona-mongodb-test", "name": "xxxxx-restaure-physical-full-test-1", "reconcileID": "b9491d27-98f2-4fdc-a57d-4e1bcc98c0d7", "status": {"type":"physical","opid":"","name":"2025-02-04T17:08:15.142876273Z","replsets":[{"name":"cfg","start_ts":0,"status":"done","last_transition_ts":1738689045,"first_write_ts":{"T":0,"I":0},"last_write_ts":{"T":0,"I":0},"node":"","conditions":null},{"name":"rs0","start_ts":0,"status":"down","last_transition_ts":1738688969,"first_write_ts":{"T":0,"I":0},"last_write_ts":{"T":0,"I":0},"node":"","conditions":null}],"compression":"","store":{"type":""},"size":0,"mongodb_version":"","fcv":"","start_ts":0,"last_transition_ts":1738688924,"first_write_ts":{"T":0,"I":0},"last_write_ts":{"T":0,"I":0},"hb":{"T":0,"I":0},"status":"running","conditions":null,"n":null,"pbm_version":"","balancer":""}}

Can you help me understand what the problem is please and see how I can stop the physical restoration properly to return to a nominal operating state of the cluster?

Thank you

Hi @fabien.hannecart,

More than 24 hours for 22GB doesn’t sound normal to me. What’s the status of your pods? If you exec into one of the mongod containers and run ps faux, do you see mongod and pbm-agent processes? Also, there’s a /tmp/pbm-agent.log in mongod containers during physical restore, that might include helpful information to debug.

1 Like

Hi @Ege_Gunes

Thanks for your help.

I rebuilt a new test cluster, I reimported data with mongorestore, then I made a backup again and tried a new physical restore.

Then I followed your recommendations, and in addition I added the DEBUG logs on the operator, and here is what I observe:

I will start with the operator logs. we see in particular this, with "status": "down", for RS0:

      {
        "name": "cfg",
        "start_ts": 0,
        "status": "done",
        "last_transition_ts": 1739870407,
        "first_write_ts": { "T": 0, "I": 0 },
        "last_write_ts": { "T": 0, "I": 0 },
        "node": "",
        "conditions": null
      },
      {
        "name": "rs0",
        "start_ts": 0,
        "status": "down",
        "last_transition_ts": 1739870339,
        "first_write_ts": { "T": 0, "I": 0 },
        "last_write_ts": { "T": 0, "I": 0 },
        "node": "",
        "conditions": null
      }
    ],

Then on the CFG if I do ps faux after more than an hour of operation I have:

USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
mongodb     5292  0.0  0.0  12256  3200 pts/0    Ss   09:19   0:00 bash
mongodb    11325  0.0  0.0  44792  3456 pts/0    R+   09:39   0:00  \_ ps faux
mongodb        1  0.0  0.0  12000  2432 ?        Ss   09:03   0:00 /bin/bash /opt/percona/physical-restore-ps-entry.sh --bind_ip_all --auth --dbpath=/data/db --port=27017 --replSet=cfg --storageEngine=wiredTi
mongodb     4906  0.0  0.0   4496  1408 ?        S    09:19   0:00 tail -n +1 -f /tmp/pbm-agent.log

In the log file CFG-0 we see on one of them:

2025-02-18T09:21:48.000+0000 I change stream was closed
2025-02-18T09:21:48.000+0000 D [restore/2025-02-18T09:17:43.919307039Z] hearbeats stopped
2025-02-18T09:21:48.000+0000 D [restore/2025-02-18T09:17:43.919307039Z] uploading ".pbm.restore/2025-02-18T09:17:43.919307039Z/rs.cfg/log/psmdb-db-devops-cfg-0.psmdb-db-devops-cfg.percona-mongodb-test.svc.cluster.local:27017.0.log" [size hint: -1 (unknown); part size: 268435456 (256.00MB)]
2025/02/18 09:21:48 Exit: <nil>
2025-02-18T09:21:48.000+0000 D [agentCheckup] deleting agent status

And the others CFG indicate:

2025-02-18T09:21:48.000+0000 D [restore/2025-02-18T09:17:43.919307039Z] hearbeats stopped
2025-02-18T09:21:48.000+0000 D [restore/2025-02-18T09:17:43.919307039Z] uploading ".pbm.restore/2025-02-18T09:17:43.919307039Z/rs.cfg/log/psmdb-db-devops-cfg-1.psmdb-db-devops-cfg.percona-mongodb-test.svc.cluster.local:27017.0.log" [size hint: -1 (unknown); part size: 268435456 (256.00MB)]
2025-02-18T09:21:48.000+0000 D [agentCheckup] deleting agent status
2025-02-18T09:21:48.000+0000 E [pitr] init: get conf: get: server selection error: server selection timeout, current topology: { Type: ReplicaSetNoPrimary, Servers: [{ Addr: psmdb-db-devops-cfg-0.psmdb-db-devops-cfg.percona-mongodb-test.svc.cluster.local:27017, Type: Unknown, Last error: dial tcp 100.64.101.202:27017: connect: connection refused }, { Addr: psmdb-db-devops-cfg-1.psmdb-db-devops-cfg.percona-mongodb-test.svc.cluster.local:27017, Type: Unknown, Last error: dial tcp 100.65.172.8:27017: connect: connection refused }, { Addr: psmdb-db-devops-cfg-2.psmdb-db-devops-cfg.percona-mongodb-test.svc.cluster.local:27017, Type: Unknown, Last error: dial tcp 100.65.118.138:27017: connect: connection refused }, ] }
2025/02/18 09:21:48 Exit: <nil>

For the RS0 the logs do not show anything in particular and the ps faux shows:

USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
mongodb     3345  0.0  0.0  12132  3072 pts/0    Ss   09:16   0:00 bash
mongodb    17499  0.0  0.0  44792  3328 pts/0    R+   09:50   0:00  \_ ps faux
mongodb        1  0.0  0.0  12000  2688 ?        Ss   09:06   0:00 /bin/bash /opt/percona/physical-restore-ps-entry.sh --bind_ip_all --auth --dbpath=/data/db --port=27017 --replSet=rs0 --storageEngine=wiredTiger --relaxPermChecks --sslAllowInvalidCertificates --clusterAuthMode=x509 --tlsMode=preferTLS --shardsvr --enableEncryption --encryptionKeyFile=/etc/mongodb-encryption/encryption-key --wiredTigerCacheSizeGB=3.69 --wiredTigerIndexPrefixCompression=true --quiet
mongodb        8  0.2  0.3 2818288 54324 ?       Sl   09:06   0:06 /opt/percona/pbm-agent
mongodb     4078  0.0  0.0   4496  1408 ?        S    09:18   0:00 tail -n +1 -f /tmp/pbm-agent.log

And logs :

2025-02-18T09:06:48.000+0000 I pbm-agent:
Version:   2.7.0
Platform:  linux/amd64
GitCommit: 957ac501f939b1b784fcdcf61dcc5b9bf171e0b9
GitBranch: release-2.7.0
BuildTime: 2024-10-02_12:53_UTC
GoVersion: go1.22.8
2025-02-18T09:06:48.000+0000 I starting PITR routine
2025-02-18T09:06:48.000+0000 I node: rs0/psmdb-db-devops-rs0-0.psmdb-db-devops-rs0.percona-mongodb-test.svc.cluster.local:27017
2025-02-18T09:06:48.000+0000 I conn level ReadConcern: majority; WriteConcern: majority
2025-02-18T09:06:48.000+0000 I listening for the commands
2025-02-18T09:06:59.000+0000 I got command resync <ts: 1739869619>, opid: 67b44db3435bc9c1c3b9059e
2025-02-18T09:06:59.000+0000 I got epoch {1739869618 2}
2025-02-18T09:06:59.000+0000 I [resync] not a member of the leader rs
2025-02-18T09:17:44.000+0000 I got command restore [name: 2025-02-18T09:17:43.919307039Z, snapshot: 2025-02-14T19:30:21Z] <ts: 1739870263>, opid: 67b45037f9aebd361e7d6655
2025-02-18T09:17:44.000+0000 I got epoch {1739870264 3}
2025-02-18T09:17:44.000+0000 I [restore/2025-02-18T09:17:43.919307039Z] backup: 2025-02-14T19:30:21Z
2025-02-18T09:17:44.000+0000 I [restore/2025-02-18T09:17:43.919307039Z] recovery started
2025-02-18T09:17:44.000+0000 D [restore/2025-02-18T09:17:43.919307039Z] port: 28086
2025-02-18T09:17:44.000+0000 D [restore/2025-02-18T09:17:43.919307039Z] uploading ".pbm.restore/2025-02-18T09:17:43.919307039Z/rs.rs0/node.psmdb-db-devops-rs0-0.psmdb-db-devops-rs0.percona-mongodb-test.svc.cluster.local:27017.hb" [size hint: 10 (10.00B); part size: 268435456 (256.00MB)]
2025-02-18T09:17:44.000+0000 D [restore/2025-02-18T09:17:43.919307039Z] uploading ".pbm.restore/2025-02-18T09:17:43.919307039Z/rs.rs0/rs.hb" [size hint: 10 (10.00B); part size: 268435456 (256.00MB)]
2025-02-18T09:17:44.000+0000 D [restore/2025-02-18T09:17:43.919307039Z] uploading ".pbm.restore/2025-02-18T09:17:43.919307039Z/cluster.hb" [size hint: 10 (10.00B); part size: 268435456 (256.00MB)]
2025-02-18T09:17:45.000+0000 D [restore/2025-02-18T09:17:43.919307039Z] mongod binary: mongod, version: v6.0.19-16
2025-02-18T09:17:45.000+0000 I [restore/2025-02-18T09:17:43.919307039Z] moving to state starting
2025-02-18T09:17:45.000+0000 D [restore/2025-02-18T09:17:43.919307039Z] uploading ".pbm.restore/2025-02-18T09:17:43.919307039Z/rs.rs0/node.psmdb-db-devops-rs0-0.psmdb-db-devops-rs0.percona-mongodb-test.svc.cluster.local:27017.starting" [size hint: 10 (10.00B); part size: 268435456 (256.00MB)]
2025-02-18T09:17:45.000+0000 I [restore/2025-02-18T09:17:43.919307039Z] waiting for cluster
2025-02-18T09:18:00.000+0000 D [restore/2025-02-18T09:17:43.919307039Z] converged to state starting
2025-02-18T09:18:00.000+0000 D [restore/2025-02-18T09:17:43.919307039Z] starting
2025-02-18T09:18:00.000+0000 I [restore/2025-02-18T09:17:43.919307039Z] moving to state running
2025-02-18T09:18:00.000+0000 D [restore/2025-02-18T09:17:43.919307039Z] uploading ".pbm.restore/2025-02-18T09:17:43.919307039Z/rs.rs0/node.psmdb-db-devops-rs0-0.psmdb-db-devops-rs0.percona-mongodb-test.svc.cluster.local:27017.running" [size hint: 10 (10.00B); part size: 268435456 (256.00MB)]
2025-02-18T09:18:00.000+0000 I [restore/2025-02-18T09:17:43.919307039Z] waiting for cluster
2025-02-18T09:18:15.000+0000 D [restore/2025-02-18T09:17:43.919307039Z] converged to state running
2025-02-18T09:18:15.000+0000 I [restore/2025-02-18T09:17:43.919307039Z] send to stopAgent chan
2025-02-18T09:18:16.000+0000 D [restore/2025-02-18T09:17:43.919307039Z] stop agents heartbeats
2025-02-18T09:18:16.000+0000 I [restore/2025-02-18T09:17:43.919307039Z] stopping mongod and flushing old data
2025-02-18T09:18:16.000+0000 D [restore/2025-02-18T09:17:43.919307039Z] shutdown server
2025-02-18T09:18:32.000+0000 D [restore/2025-02-18T09:17:43.919307039Z] waiting for the node to shutdown
2025-02-18T09:18:33.000+0000 D [restore/2025-02-18T09:17:43.919307039Z] remove old data
2025-02-18T09:18:33.000+0000 D [restore/2025-02-18T09:17:43.919307039Z] remove /data/db/collection-10635--4076834933549043333.wt
2025-02-18T09:18:33.000+0000 D [restore/2025-02-18T09:17:43.919307039Z] remove /data/db/index-13512--4076834933549043333.wt
2025-02-18T09:18:33.000+0000 D [restore/2025-02-18T09:17:43.919307039Z] remove /data/db/index-12439--4076834933549043333.wt
2025-02-18T09:18:33.000+0000 D [restore/2025-02-18T09:17:43.919307039Z] remove /data/db/index-10777--4076834933549043333.wt
2025-02-18T09:18:33.000+0000 D [restore/2025-02-18T09:17:43.919307039Z] remove /data/db/index-15148--4076834933549043333.wt
......
...

I don’t put the complete logs because afterwards it’s just removes etc…

I still can not understand or solve this problem! If you have a lead I am interested? Do you think the problem comes from my actions or is it a PBM problem that I cannot solve?