Percona oplog-replay is not working after physical base snapshot restore was completed

Hi,

I had 1 MongoDB instance with 3 replicase, I’d setup latest pbm-agent to that instance. (My MongonDB is running in VMs, self managed, community edition).

Below are versions in source MongoDB:

  1. pbm-agent version - 2.12.0

  2. pbm version - pbm cli - 2.12.0

  3. mongod --version - db version v7.0.11-6

  4. mongosh --version - 2.2.10

Target MongoDB where I want to restore backup has below config:

  1. pbm-agent version - 2.12.0

  2. pbm version - pbm cli - 2.12.0

  3. mongod --version - db version v7.0.11-1

  4. mongosh --version - 2.2.10

I ran full physical backup with PITR enabled in source, around 160 GB with GZIP compression with below config:

[vikas@XXXtest ~]$ pbm config
storage:
  type: gcs
  gcs:
    bucket: XXXXX_bkp
    prefix: pbm/stg_v2
    credentials:
      hmacAccessKey: '***'
      hmacSecret: '***'
    chunkSize: 10485760
    retryer:
      backoffInitial: 1s
      backoffMax: 30s
      backoffMultiplier: 2
      maxAttempts: 5
      chunkRetryDeadline: 32s
pitr:
  enabled: false
  compression: gzip
backup:
  oplogSpanMin: 0
  compression: gzip
restore: {}

pbm status command post backup completion says below:

pbm status
Cluster:
========
rs2:
  - XXX03.asiaXXc.c.prod.internal:27017 [P]: pbm-agent [v2.12.0] OK
  - XXX02.asiaXXb.c.prod.internal:27017 [S]: pbm-agent [v2.12.0] OK
  - XXX01.asiaXXa.c.prod.internal:27017 [S]: pbm-agent [v2.12.0] OK


PITR incremental backup:
========================
Status [ON]
Running members: rs2/XXX01.asiaXXa.c.prod.internal:27017; 

Currently running:
==================
(none)

Backups:
========
GCS  gs://XXXX_stg_bkp/pbm/stg_v2
  Snapshots:
    2025-12-11T07:04:59Z 160.63GB <physical> success [restore_to_time: 2025-12-11T07:05:02]
  PITR chunks [160.83MB]:
    2025-12-11T07:05:03 - 2025-12-15T06:30:15

In my new MongoDB instance, I have configured storage with same GCP creds. Then I ran pbm config –force-resync' and then pbm status gave me same output as above.

Then I ran & tested restore with below 2 scenarios:

  1. pbm restore to exact physical snapshot name - pbm restore 2025-12-11T07:04:59Z
  2. pbm restore to PITR chunk range using –time - pbm restore –time 2025-12-15T06:30:15

1st option, I was able to restore till 11th Dec, but then I tried to replay oplogs with command pbm oplog-replay –start=2025-12-11T07:05:03 –end=2025-12-15T06:30:15, it did not work.

Logs:

2025-12-11T06:41:35Z E [rs2/ZZZ03.asia-XXXc.c.prod.internal:27017] [agentCheckup] check storage connection: storage check failed with: file stat: list with parts for mw file stat op: fetching pbm file parts base for .pbm.init: get S3 object header: operation error S3: HeadObject, https response error StatusCode: 403, RequestID: , HostID: , api error Forbidden: Forbidden
...

Lot of logs with similar lines as above.

I used the same creds for backup, restore and it was working so definetely not permission issue.

My Bucket structure looks as below at location XXX_stg_bkp/pbm/stg_v2:

XXX_stg_bkp/pbm/stg_v2/pbmPitr/rs2/20251215 path has below content:

20251215000015-1.20251215001015-1.oplog.gz
Dec 15, 2025, 5:40:15 AM
20251215001015-1.20251215002015-1.oplog.gz
Dec 15, 2025, 5:50:15 AM
20251215002015-1.20251215003015-1.oplog.gz
Dec 15, 2025, 6:00:15 AM
20251215003015-1.20251215004015-1.oplog.gz
Dec 15, 2025, 6:10:15 AM

Tried different scenarios, like from 2nd - pbm restore to PITR chunk range using –time - pbm restore –time 2025-12-15T06:30:15, I got similar errors, with below error that seems new



2025-12-15T06:26:02Z W [rs2/XXX06-test.asiaXXXc.c.prod.internal:27017] [replay/2025-12-15T06:23:31.57268086Z] failed to download chunk 8388608-16777215
2025-12-15T06:26:04Z E [rs2/XXX06-test.asia-XXXc.c.prod.internal:27017] [replay/2025-12-15T06:23:31.57268086Z] oplog replay: reply oplog: replay chunk 1765436702.1765449737: apply oplog for chunk: applying an entry: op: {"Timestamp":{"T":1765440413,"I":5},"Term":10,"Hash":null,"Version":2,"Operation":"i","Namespace":"stg_common_acs.gatemappings","Object":[{"Key":"_id","Value":"693a7b9df9ee0cb4b05156cc"},{"Key":"createdBy","Value":"eventsync"},{"Key":"updatedBy","Value":"eventsync"},{"Key":"gateId","Value":"693a7797f9ee0cb4b0514fee"},{"Key":"gateGroupId","Value":"693969eac9049198065e6310"},{"Key":"createdDate","Value":"2025-12-11T08:06:53.331Z"},{"Key":"updatedDate","Value":"2025-12-11T08:06:53.331Z"}],"Query":[{"Key":"_id","Value":"693a7b9df9ee0cb4b05156cc"}],"UI":{"Subtype":4,"Data":"jvdFFGH0flMjMQ=="},"LSID":"SXxwsdwdwbizsP+KDqbJUFdWsASXUl/Zvdfdsf0sdwdwdwfHdfsdfoXPEA","TxnNumber":1194,"PrevOpTime":"HAAwfsdfdsfAAwdwddAABJ0AP//////////AA==","MultiOpType":null} | merr <nil>: applyOps: (DuplicateKey) E11000 duplicate key error collection: stg_common_acs.gatemappings index: gateId_1_gateGroupId_1 dup key: { gateId: ObjectId('693a7sdsdsdsdb4b0514fee'), gateGroupId: ObjectId('693961212126310') }
{"ts":1765779812,"s":3,"rs":"rs2","node":"XXXX05-test.asia-XXXb.c.prod.internal:27017","e":"replay","eobj":"2025-12-15T06:23:31.57268086Z","ep":{"T":1765779777,"I":1},"opid":"693fa9634dd170ef13363b31","msg":"time: 2025-12-11T07:05:03Z-2025-12-12T05:40:00Z"}
{"ts":1765779812,"s":3,"rs":"rs2","node":"XXX05-test.asiaXXXb.c.prod.internal:27017","e":"replay","eobj":"2025-12-15T06:23:31.57268086Z","ep":{"T":1765779777,"I":1},"opid":"693fa9634dd170ef13363b31","msg":"node in not suitable for restore"}

I have explored various options, scenarios, etc. No luck.

I can’t restore using PITR, either directly or first from base snapshot & then replaying oplogs.

Hi, the message

"msg":"node in not suitable for restore"

is expected on non-primary servers during the restore. Oplog apply happens only on the primary. Can you check PBM agent logs on the remaining nodes?

Also regarding your test scenario, applying 5 days of oplog is probably going to take too long since apply is single threaded. Typically you should aim at applying only no more than a few hours of oplog. You can use incremental backups to reduce the gap.

1 Like

I have 2 rs for MongoDB, one of them is primary. Both nodes shows the same logs.
"msg":"node in not suitable for restore"

Also can share some insights on below error?

2025-12-15T06:26:04Z E [rs2/XXX06-test.asia-XXXc.c.prod.internal:27017] [replay/2025-12-15T06:23:31.57268086Z] oplog replay: reply oplog: replay chunk 1765436702.1765449737: apply oplog for chunk: applying an entry: op: {“Timestamp”:{“T”:1765440413,“I”:5},“Term”:10,“Hash”:null,“Version”:2,“Operation”:“i”,“Namespace”:“stg_common_acs.gatemappings”,“Object”:[{“Key”:“_id”,“Value”:“693a7b9df9ee0cb4b05156cc”},{“Key”:“createdBy”,“Value”:“eventsync”},{“Key”:“updatedBy”,“Value”:“eventsync”},{“Key”:“gateId”,“Value”:“693a7797f9ee0cb4b0514fee”},{“Key”:“gateGroupId”,“Value”:“693969eac9049198065e6310”},{“Key”:“createdDate”,“Value”:“2025-12-11T08:06:53.331Z”},{“Key”:“updatedDate”,“Value”:“2025-12-11T08:06:53.331Z”}],“Query”:[{“Key”:“_id”,“Value”:“693a7b9df9ee0cb4b05156cc”}],“UI”:{“Subtype”:4,“Data”:“jvdFFGH0flMjMQ==”},“LSID”:“SXxwsdwdwbizsP+KDqbJUFdWsASXUl/Zvdfdsf0sdwdwdwfHdfsdfoXPEA”,“TxnNumber”:1194,“PrevOpTime”:“HAAwfsdfdsfAAwdwddAABJ0AP//////////AA==”,“MultiOpType”:null} | merr : applyOps: (DuplicateKey) E11000 duplicate key error collection: stg_common_acs.gatemappings index: gateId_1_gateGroupId_1 dup key: { gateId: ObjectId(‘693a7sdsdsdsdb4b0514fee’), gateGroupId: ObjectId(‘693961212126310’) }

Action items I’ve performed that resulted in below errors are:

  1. Restore to physical backup
  2. Restart mongod & pbm-agent services
  3. Get inside mongosh shell to initiate & add rs member
  4. pbm config –force-resync to get latest state
  5. Then run pbm oplog-replay –start=<> –end=<>

There were no db modification post physical restore.

Also, both nodes show below error as well.

Hi @vikas.bahuguna ,

As Ivan said, this message depicts that the node is not suitable for write, i.e it is not PRIMARY. Are you sure that all pbm agents report the same? In that case, have you checked if PRIMARY is available at that time? We need to analyze all pbm-clients’ log to analyze the issue properly.

From code:

	if !nodeInfo.IsPrimary {
		l.Info("node in not suitable for restore")
		return
	}

.applyOps: (DuplicateKey) E11000 duplicate key error collection: stg_common_acs.gatemappings index: gateId_1_gateGroupId_1 dup key: { gateId: ObjectId(‘693a7sdsdsdsdb4b0514fee’), gateGroupId: ObjectId(‘693961212126310’) }

The above error showed the duplicate error while applying the oplog. You can check if “gateId_1_gateGroupId_1" has an unique key and whether the record is already there before restore. Also, here you need to specify the –start and –end times appropriately, where I suspect the problem lies. You can check the last ts applied from oplog.rs collection for mentioning the proper –start time. Refer here. Also, you need to make sure that PITR is not enabled after restoring the backup and before restoring oplog-replay in the new env.

Regards,

Vinodh Guruji