PBM restore takes long time

Team,

I’m doing selective restore - restoring a single collection (6GB) from a FULL backup (2.5TB).
its keep running for 11+ hrs (still progressing).

Is this an expected behaviour. Please help me to fix this.

command: pbm restore 2023-05-09T14:56:30Z --ns=ctr.trademarks.gecn 2>&1

[mongod@ip-10-80-11-253 ~]$ pbm status
Cluster:

shard1ReplSet:

  • shard1ReplSet/10.80.11.253:27038 [P]: pbm-agent v2.1.0 OK
    configReplSet:
  • configReplSet/10.80.11.253:27039 [P]: pbm-agent v2.1.0 OK
    shard3ReplSet:
  • shard3ReplSet/10.80.10.159:27038 [P]: pbm-agent v2.1.0 OK
    shard2ReplSet:
  • shard2ReplSet/10.80.10.186:27038 [P]: pbm-agent v2.1.0 OK

PITR incremental backup:

Status [OFF]

Currently running:

Snapshot restore “2023-05-09T14:56:30Z”, started at 2023-05-16T04:30:23Z. Status: oplog restore. [op id: 646306df173cdb6959b600ac]

Backups:

S3 us-east-1 s3://xxxxxxxxxx/pbm/backup/
Snapshots:
2023-05-12T07:58:25Z 92.68KB <logical, selective> [restore_to_time: 2023-05-12T07:58:33Z]
2023-05-12T07:49:04Z 6.53GB <logical, selective> [restore_to_time: 2023-05-12T07:54:17Z]
2023-05-12T06:56:45Z 75.56KB <logical, selective> [restore_to_time: 2023-05-12T06:56:52Z]
2023-05-11T14:17:04Z 752.69GB <logical, selective> [restore_to_time: 2023-05-11T17:33:36Z]
2023-05-09T14:56:30Z 2.68TB [restore_to_time: 2023-05-10T01:41:15Z]


[mongod@ip-10-80-11-253 ~]$ pbm logs
2023-05-16T04:30:23Z I [shard1ReplSet/10.80.11.253:27038] got epoch {1684211244 3}
2023-05-16T04:30:23Z I [shard1ReplSet/10.80.11.253:27038] [restore/2023-05-16T04:30:23.527718212Z] backup: 2023-05-09T14:56:30Z
2023-05-16T04:30:23Z I [shard1ReplSet/10.80.11.253:27038] [restore/2023-05-16T04:30:23.527718212Z] restore started
2023-05-16T04:30:24Z I [configReplSet/10.80.11.253:27039] [restore/2023-05-16T04:30:23.527718212Z] moving to state running
2023-05-16T04:30:24Z I [shard2ReplSet/10.80.10.186:27038] [restore/2023-05-16T04:30:23.527718212Z] moving to state running
2023-05-16T04:30:24Z I [shard1ReplSet/10.80.11.253:27038] [restore/2023-05-16T04:30:23.527718212Z] moving to state running
2023-05-16T04:30:27Z I [configReplSet/10.80.11.253:27039] [restore/2023-05-16T04:30:23.527718212Z] moving to state dumpDone
2023-05-16T04:32:47Z I [shard3ReplSet/10.80.10.159:27038] [restore/2023-05-16T04:30:23.527718212Z] moving to state dumpDone
2023-05-16T04:32:52Z I [shard2ReplSet/10.80.10.186:27038] [restore/2023-05-16T04:30:23.527718212Z] moving to state dumpDone
2023-05-16T04:32:55Z I [shard3ReplSet/10.80.10.159:27038] [restore/2023-05-16T04:30:23.527718212Z] starting oplog replay
2023-05-16T04:32:56Z I [shard1ReplSet/10.80.11.253:27038] [restore/2023-05-16T04:30:23.527718212Z] moving to state dumpDone
2023-05-16T04:32:57Z I [shard2ReplSet/10.80.10.186:27038] [restore/2023-05-16T04:30:23.527718212Z] starting oplog replay
2023-05-16T04:32:57Z I [shard1ReplSet/10.80.11.253:27038] [restore/2023-05-16T04:30:23.527718212Z] starting oplog replay
2023-05-16T04:32:58Z I [configReplSet/10.80.11.253:27039] [restore/2023-05-16T04:30:23.527718212Z] starting oplog replay
2023-05-16T04:33:53Z I [shard3ReplSet/10.80.10.159:27038] [restore/2023-05-16T04:30:23.527718212Z] oplog replay finished on {1683682875 266}
2023-05-16T04:33:53Z I [shard3ReplSet/10.80.10.159:27038] [restore/2023-05-16T04:30:23.527718212Z] restore finished successfully
2023-05-16T04:35:58Z I [shard2ReplSet/10.80.10.186:27038] [restore/2023-05-16T04:30:23.527718212Z] oplog replay finished on {1683682875 261}
2023-05-16T04:35:58Z I [shard2ReplSet/10.80.10.186:27038] [restore/2023-05-16T04:30:23.527718212Z] restore finished successfully
2023-05-16T04:40:36Z I [shard1ReplSet/10.80.11.253:27038] [restore/2023-05-16T04:30:23.527718212Z] oplog replay finished on {1683682875 267}
2023-05-16T04:40:36Z I [shard1ReplSet/10.80.11.253:27038] [restore/2023-05-16T04:30:23.527718212Z] restore finished successfully

Hi, I don’t see any problems in the logs you provided. Do you see the data from the restored collection is complete? You might want to open a bug in our jira.percona.com so the dev team can take a look

Hi @Ivan_Groenewold

No, Restore of specific collection didn’t complete.
I see from the logs that data shards finished restore in 5min but the config server is stuck on the oplog replay.
That restore being run for 24+ hrs and i had to kill that.
I already opened a bug [PBM-1120] PBM restore takes long time - Percona JIRA

Hi @ Dmytro_Zghoba,

Can you please guide me how to proceed with this.

I couldn’t able to do the restore of specific collection/DB from full backup. This affects our prod env.
I see that,As per logs PBM agent has finished but didn’t update restore status.
the status is shows as “Status: oplog restore” for past 3 days, nothing revealed in the logs.