Selective restore not working

I have tested below scenarios-

  1. FULL restore to a new cluster - works fine
  2. Restore a selective database/collection from a full cluster backup - Fails
pbm restore 2024-06-29T09:20:02Z  --ns=ctr["trademarks.gecn"]  2>&1 
[mongod@ip-10-80-11-188 ~]$ pbm status
Cluster:
========
shard3ReplSet:
  - shard3ReplSet/10.80.11.188:27038 [P]: pbm-agent v2.5.0 OK
shard1ReplSet:
  - shard1ReplSet/10.80.11.0:27038 [P]: pbm-agent v2.5.0 OK
configReplSet:
  - configReplSet/10.80.11.0:27039 [P]: pbm-agent v2.5.0 OK
shard2ReplSet:
  - shard2ReplSet/10.80.11.40:27038 [P]: pbm-agent v2.5.0 OK


PITR incremental backup:
========================
Status [OFF]

Currently running:
==================
Snapshot restore "2024-06-29T09:20:02Z", started at 2024-07-03T07:25:28Z. Status: oplog restore. [op id: 6684fce7fbcb73ddba72fb13]

Backups:
========
S3 us-east-1 s3://cm-mongo-db-shared-prod-va/percona/backup/
  Snapshots:
    2024-06-29T09:20:02Z 1.18TB <logical> [restore_to_time: 2024-06-29T16:09:00Z]
  PITR chunks [20.33GB]:
    2024-06-29T16:09:01Z - 2024-07-02T07:02:01Z
[mongod@ip-10-80-11-188 ~]$ pbm logs
2024-07-03T07:25:28Z I [shard3ReplSet/10.80.11.188:27038] [restore/2024-07-03T07:25:27.70086972Z] backup: 2024-06-29T09:20:02Z
2024-07-03T07:25:28Z I [shard3ReplSet/10.80.11.188:27038] [restore/2024-07-03T07:25:27.70086972Z] recovery started
2024-07-03T07:25:28Z I [shard2ReplSet/10.80.11.40:27038] [restore/2024-07-03T07:25:27.70086972Z] moving to state running
2024-07-03T07:25:28Z I [shard1ReplSet/10.80.11.0:27038] [restore/2024-07-03T07:25:27.70086972Z] moving to state running
2024-07-03T07:25:29Z I [configReplSet/10.80.11.0:27039] [restore/2024-07-03T07:25:27.70086972Z] moving to state running
2024-07-03T07:25:29Z I [shard3ReplSet/10.80.11.188:27038] [restore/2024-07-03T07:25:27.70086972Z] moving to state running
2024-07-03T07:25:30Z I [shard2ReplSet/10.80.11.40:27038] [restore/2024-07-03T07:25:27.70086972Z] moving to state dumpDone
2024-07-03T07:25:30Z I [shard3ReplSet/10.80.11.188:27038] [restore/2024-07-03T07:25:27.70086972Z] moving to state dumpDone
2024-07-03T07:25:31Z I [shard1ReplSet/10.80.11.0:27038] [restore/2024-07-03T07:25:27.70086972Z] moving to state dumpDone
2024-07-03T07:25:31Z I [configReplSet/10.80.11.0:27039] [restore/2024-07-03T07:25:27.70086972Z] moving to state dumpDone
2024-07-03T07:25:33Z I [shard1ReplSet/10.80.11.0:27038] [restore/2024-07-03T07:25:27.70086972Z] starting oplog replay
2024-07-03T07:25:33Z I [configReplSet/10.80.11.0:27039] [restore/2024-07-03T07:25:27.70086972Z] starting oplog replay
2024-07-03T07:25:33Z I [shard2ReplSet/10.80.11.40:27038] [restore/2024-07-03T07:25:27.70086972Z] starting oplog replay
2024-07-03T07:25:33Z I [shard3ReplSet/10.80.11.188:27038] [restore/2024-07-03T07:25:27.70086972Z] starting oplog replay
2024-07-03T07:25:46Z I [shard3ReplSet/10.80.11.188:27038] [restore/2024-07-03T07:25:27.70086972Z] oplog replay finished on {1719677340 4}
2024-07-03T07:25:47Z I [shard3ReplSet/10.80.11.188:27038] [restore/2024-07-03T07:25:27.70086972Z] recovery successfully finished
2024-07-03T07:26:24Z I [shard2ReplSet/10.80.11.40:27038] [restore/2024-07-03T07:25:27.70086972Z] oplog replay finished on {1719677340 6}
2024-07-03T07:26:27Z I [shard2ReplSet/10.80.11.40:27038] [restore/2024-07-03T07:25:27.70086972Z] recovery successfully finished
2024-07-03T07:27:25Z I [shard1ReplSet/10.80.11.0:27038] [restore/2024-07-03T07:25:27.70086972Z] oplog replay finished on {1719677340 6}
2024-07-03T07:27:41Z I [shard1ReplSet/10.80.11.0:27038] [restore/2024-07-03T07:25:27.70086972Z] recovery successfully finished

Looks like its hung on oplog replay of configReplSet. Please advise

@aranjith0

I see one jira report - [PBM-1120] - Percona JIRA in old PBM[2.1] where the selective restore taking much time to finish in Sharding environment. Although it seems to be not reproducible.

Bye the way, how big is this collection “trademarks.gecn” ?

Can you please confirm the backup/restore details as mentioned in below commands ?

pbm describe-backup 2024-06-29T09:20:02Z 
pbm describe-restore 2024-06-29T09:20:02Z

In your config node also you can check the restore status with respect to that backup file.

configReplSet> db.getSiblingDB('admin').pbmRestores.find({name: "2024-06-29T09:20:02Z "})

Does that selective restore successfully completed OR still stuck OR if you have kill the process ?

Did you check the Shard, config, mongos/router status and logs around that corner. Are there any heavy workload/ multi-collection transaction were running ? If this is the case, then please try running the same in less traffic/off-hours period.

Did you observed such problems with some other collections or selective backup as well OR if only observed with this collection[“trademarks.gecn”] only ?

Hi @anil.joshi ,
The selective restore is still stuck and i have killed the process.

Please see my comments below-

how big is this collection “trademarks.gecn” ? → its less than 1 GB
Did you check the Shard, config, mongos/router status and logs around that corner.
→ yes, no abnormal messages found.
Are there any heavy workload/ multi-collection transaction were running ? → No, because i am restoring a new cluster.
Did you observed such problems with some other collections or selective backup as well OR if only observed with this collection[“trademarks.gecn”] only ?
→ on all collections, even its smaller is size (KBs). When i restore a selective collection or DB from a full cluster backup i face this issue.

Please schedule a call for 15 min anytime , I will show you the exact scenario which i am facing. Appreciate your help.

@aranjith0

Please schedule a call for 15 min anytime , I will show you the exact scenario which i am facing.

I am afraid we can’t directly offered the call services however you can opt our subscription or other services - Services - Percona where we can provide a better support for such requests.

By any chance you got the below details i asked previously ?

pbm describe-backup 2024-06-29T09:20:02Z 
pbm describe-restore 2024-06-29T09:20:02Z
configReplSet> db.getSiblingDB('admin').pbmRestores.find({name: "2024-06-29T09:20:02Z "})

Would it be possible for you to get the below information as well from the Primary node ?

mongo  --eval "db.getSiblingDB('admin').pbmLock.find().forEach(function(f){printjson(f)})" > pbmLock.meta;
mongo  --eval "db.getSiblingDB('admin').pbmLockOp.find().forEach(function(f){printjson(f)})" > pbmLockOp.meta;
mongo --eval "db.getSiblingDB('admin').pbmOpLog.find().forEach(function(f){printjson(f)})" > pbmOpLog.meta;