Selective restore not working

I have tested below scenarios-

  1. FULL restore to a new cluster - works fine
  2. Restore a selective database/collection from a full cluster backup - Fails
pbm restore 2024-06-29T09:20:02Z  --ns=ctr["trademarks.gecn"]  2>&1 
[mongod@ip-10-80-11-188 ~]$ pbm status
Cluster:
========
shard3ReplSet:
  - shard3ReplSet/10.80.11.188:27038 [P]: pbm-agent v2.5.0 OK
shard1ReplSet:
  - shard1ReplSet/10.80.11.0:27038 [P]: pbm-agent v2.5.0 OK
configReplSet:
  - configReplSet/10.80.11.0:27039 [P]: pbm-agent v2.5.0 OK
shard2ReplSet:
  - shard2ReplSet/10.80.11.40:27038 [P]: pbm-agent v2.5.0 OK


PITR incremental backup:
========================
Status [OFF]

Currently running:
==================
Snapshot restore "2024-06-29T09:20:02Z", started at 2024-07-03T07:25:28Z. Status: oplog restore. [op id: 6684fce7fbcb73ddba72fb13]

Backups:
========
S3 us-east-1 s3://cm-mongo-db-shared-prod-va/percona/backup/
  Snapshots:
    2024-06-29T09:20:02Z 1.18TB <logical> [restore_to_time: 2024-06-29T16:09:00Z]
  PITR chunks [20.33GB]:
    2024-06-29T16:09:01Z - 2024-07-02T07:02:01Z
[mongod@ip-10-80-11-188 ~]$ pbm logs
2024-07-03T07:25:28Z I [shard3ReplSet/10.80.11.188:27038] [restore/2024-07-03T07:25:27.70086972Z] backup: 2024-06-29T09:20:02Z
2024-07-03T07:25:28Z I [shard3ReplSet/10.80.11.188:27038] [restore/2024-07-03T07:25:27.70086972Z] recovery started
2024-07-03T07:25:28Z I [shard2ReplSet/10.80.11.40:27038] [restore/2024-07-03T07:25:27.70086972Z] moving to state running
2024-07-03T07:25:28Z I [shard1ReplSet/10.80.11.0:27038] [restore/2024-07-03T07:25:27.70086972Z] moving to state running
2024-07-03T07:25:29Z I [configReplSet/10.80.11.0:27039] [restore/2024-07-03T07:25:27.70086972Z] moving to state running
2024-07-03T07:25:29Z I [shard3ReplSet/10.80.11.188:27038] [restore/2024-07-03T07:25:27.70086972Z] moving to state running
2024-07-03T07:25:30Z I [shard2ReplSet/10.80.11.40:27038] [restore/2024-07-03T07:25:27.70086972Z] moving to state dumpDone
2024-07-03T07:25:30Z I [shard3ReplSet/10.80.11.188:27038] [restore/2024-07-03T07:25:27.70086972Z] moving to state dumpDone
2024-07-03T07:25:31Z I [shard1ReplSet/10.80.11.0:27038] [restore/2024-07-03T07:25:27.70086972Z] moving to state dumpDone
2024-07-03T07:25:31Z I [configReplSet/10.80.11.0:27039] [restore/2024-07-03T07:25:27.70086972Z] moving to state dumpDone
2024-07-03T07:25:33Z I [shard1ReplSet/10.80.11.0:27038] [restore/2024-07-03T07:25:27.70086972Z] starting oplog replay
2024-07-03T07:25:33Z I [configReplSet/10.80.11.0:27039] [restore/2024-07-03T07:25:27.70086972Z] starting oplog replay
2024-07-03T07:25:33Z I [shard2ReplSet/10.80.11.40:27038] [restore/2024-07-03T07:25:27.70086972Z] starting oplog replay
2024-07-03T07:25:33Z I [shard3ReplSet/10.80.11.188:27038] [restore/2024-07-03T07:25:27.70086972Z] starting oplog replay
2024-07-03T07:25:46Z I [shard3ReplSet/10.80.11.188:27038] [restore/2024-07-03T07:25:27.70086972Z] oplog replay finished on {1719677340 4}
2024-07-03T07:25:47Z I [shard3ReplSet/10.80.11.188:27038] [restore/2024-07-03T07:25:27.70086972Z] recovery successfully finished
2024-07-03T07:26:24Z I [shard2ReplSet/10.80.11.40:27038] [restore/2024-07-03T07:25:27.70086972Z] oplog replay finished on {1719677340 6}
2024-07-03T07:26:27Z I [shard2ReplSet/10.80.11.40:27038] [restore/2024-07-03T07:25:27.70086972Z] recovery successfully finished
2024-07-03T07:27:25Z I [shard1ReplSet/10.80.11.0:27038] [restore/2024-07-03T07:25:27.70086972Z] oplog replay finished on {1719677340 6}
2024-07-03T07:27:41Z I [shard1ReplSet/10.80.11.0:27038] [restore/2024-07-03T07:25:27.70086972Z] recovery successfully finished

Looks like its hung on oplog replay of configReplSet. Please advise

@aranjith0

I see one jira report - [PBM-1120] - Percona JIRA in old PBM[2.1] where the selective restore taking much time to finish in Sharding environment. Although it seems to be not reproducible.

Bye the way, how big is this collection “trademarks.gecn” ?

Can you please confirm the backup/restore details as mentioned in below commands ?

pbm describe-backup 2024-06-29T09:20:02Z 
pbm describe-restore 2024-06-29T09:20:02Z

In your config node also you can check the restore status with respect to that backup file.

configReplSet> db.getSiblingDB('admin').pbmRestores.find({name: "2024-06-29T09:20:02Z "})

Does that selective restore successfully completed OR still stuck OR if you have kill the process ?

Did you check the Shard, config, mongos/router status and logs around that corner. Are there any heavy workload/ multi-collection transaction were running ? If this is the case, then please try running the same in less traffic/off-hours period.

Did you observed such problems with some other collections or selective backup as well OR if only observed with this collection[“trademarks.gecn”] only ?

Hi @anil.joshi ,
The selective restore is still stuck and i have killed the process.

Please see my comments below-

how big is this collection “trademarks.gecn” ? → its less than 1 GB
Did you check the Shard, config, mongos/router status and logs around that corner.
→ yes, no abnormal messages found.
Are there any heavy workload/ multi-collection transaction were running ? → No, because i am restoring a new cluster.
Did you observed such problems with some other collections or selective backup as well OR if only observed with this collection[“trademarks.gecn”] only ?
→ on all collections, even its smaller is size (KBs). When i restore a selective collection or DB from a full cluster backup i face this issue.

Please schedule a call for 15 min anytime , I will show you the exact scenario which i am facing. Appreciate your help.

@aranjith0

Please schedule a call for 15 min anytime , I will show you the exact scenario which i am facing.

I am afraid we can’t directly offered the call services however you can opt our subscription or other services - Services - Percona where we can provide a better support for such requests.

By any chance you got the below details i asked previously ?

pbm describe-backup 2024-06-29T09:20:02Z 
pbm describe-restore 2024-06-29T09:20:02Z
configReplSet> db.getSiblingDB('admin').pbmRestores.find({name: "2024-06-29T09:20:02Z "})

Would it be possible for you to get the below information as well from the Primary node ?

mongo  --eval "db.getSiblingDB('admin').pbmLock.find().forEach(function(f){printjson(f)})" > pbmLock.meta;
mongo  --eval "db.getSiblingDB('admin').pbmLockOp.find().forEach(function(f){printjson(f)})" > pbmLockOp.meta;
mongo --eval "db.getSiblingDB('admin').pbmOpLog.find().forEach(function(f){printjson(f)})" > pbmOpLog.meta;

Hello everyone !

I allow myself to answer in this thread because I am concerned by the subject of selective restoration!
I bring back 2 information that can become issues for Percona, and I end with my problem, to which I am still looking for a solution!

Context: I am on a psmdb-db psmdb-op (mongodb) cluster deployed in kubernetes 1.29.10 with the Helm Percona 1.17.1 chart, mongo 6.0.9-7 and PBM 2.5.0. 1 replicaset x3 instances, 3x cfg, 2 mongos, 1 operator percona.

  1. First note that I did not manage to use a PerconaServerMongoDBRestore kind restoration with file !

My file :

apiVersion: psmdb.percona.com/v1
kind: PerconaServerMongoDBRestore
metadata:
  name: restore-test-selectif-1
spec:
  clusterName: "psmdb-db-devops-tests"
  backupName: "cron-psmdb-db-devops-tests-20250121051000-jm9mc"
  selective:
    withUsersAndRoles: true
    namespaces:
    - "mydb_xxxxxxxx-xxxx-xxxx-xxxx-8c0aed483482.*"
  # backupSource:
  #   type: "logical"
  #   destination: "s3://data-backups-tests/2025-01-21T05:10:21Z"
  #   s3:
  #     credentialsSecret: "pbm-mongodb-backup-s3"
  #     region: "fr-par"
  #     bucket: "data-backups-staging"
  #     endpointUrl: "https://s3.fr-par.scw.cloud"

the error:

kubectl apply -f pbm-restaure-selectif.yaml
Error from server (BadRequest): error when creating "pbm-restaure-selectif.yaml": PerconaServerMongoDBRestore in version "v1" cannot be handled as a PerconaServerMongoDBRestore: strict decoding error: unknown field "spec.selective"

So I did restore with the following command :

pbm restore 2025-01-21T05:10:21Z --ns=mydb_xxxxxxxx-xxxx-xxxx-xxxx-8c0aed483482.* --with-users-and-roles

Maybe we should have an issue to modify the documentation or correct if there is a bug! ?

  1. Then, not arriving with the file, I did the restore manually with the PBM CLI. When I launched the restoration I wanted to follow with the command “pbm describe-restore” but I did not find the name of the restoration in “pbm status” I found it in the logs !

In pbm status i see

Currently running:
==================
Snapshot restore "2025-01-21T05:10:21Z", started at 2025-01-21T11:13:53Z. Status: oplog restore. [op id: 678f81708e508d6f9c1400b0]

In the logs I find the answer :

... got command restore [name: 2025-01-21T11:13:52.595974732Z, snapshot: 2025-01-21T05:10:21Z] <ts: 1737458032>

My command to read the status is then:

pbm describe-restore 2025-01-21T11:13:52.595974732Z

Maybe there should be an issue a way to add the name of the restoration in the response of “pbm status”

  1. Finally the restore was done by the CLI, I saw my db being restored, but still displayed “in progress” in the status “pbm status” and impossible to do another restore or backup until this operation is finalized ! So my db has been restored! I see it with “show dbs” and I can navigate in it ! But it appears as unfinished in pbm !

As it is a test cluster and not the production cluster I left it as is to find a clean solution ?

Can you help me please ?