Percona mongodb restore error Failed to apply operation due to missing collection

Hello everyone. I have a problem that I just can’t solve. I hope very much for your help.
I have a Mongodb sharded cluster which I back up using Percona Logical Backup.
I’m trying to restore data to a clean prepared mongodb sharded cluster with the same replica

assignments, but with a different server name. I keep getting the following error over and over again:
'reply oplog: replay chunk 1708916666.1708916678: apply oplog for chunk: applying
  an entry: op: {"Timestamp":{"T":1708916672,"I":48},"Term":1,"Hash":null,"Version":2,"Operation":"i","Namespace":"config.actionlog","Object":[{"Key":"_id","Value":"zst-mongodbcfg-1:27017-2024-02-26T03:04:32.719+00:00-65dbffc06235b23d81fe4e16"},{"Key":"server","Value":"zst-mongodbcfg-1:27017"},{"Key":"shard","Value":"config"},{"Key":"clientAddr","Value":"10.212.3.138:60684"},{"Key":"time","Value":"2024-02-26T03:04:32.719Z"},{"Key":"what","Value":"balancer.stop"},{"Key":"ns","Value":""},{"Key":"details","Value":[]}],"Query":[{"Key":"_id","Value":"zst-mongodbcfg-1:27017-2024-02-26T03:04:32.719+00:00-65dbffc06235b23d81fe4e16"}],"UI":{"Subtype":4,"Data":"QhhVsGwmROm1I685D3uU7A=="},"LSID":null,"TxnNumber":null,"PrevOpTime":null}
  | merr <nil>: applyOps: (NamespaceNotFound) Failed to apply operation due to missing
  collection (421855b0-6c26-44e9-b523-af390f7b94ec): { ts: Timestamp(1708916672, 48),
  t: 1, v: 2, op: "i", ns: "config.actionlog", o: { _id: "zst-mongodbcfg-1:27017-2024-02-26T03:04:32.719+00:00-65dbffc06235b23d81fe4e16",
  server: "zst-mongodbcfg-1:27017", shard: "config", clientAddr: "10.212.3.138:60684",
  time: new Date(1708916672719), what: "balancer.stop", ns: "", details: {} }, o2:
  { _id: "zst-mongodbcfg-1:27017-2024-02-26T03:04:32.719+00:00-65dbffc06235b23d81fe4e16"
  }, ui: UUID("421855b0-6c26-44e9-b523-af390f7b94ec"), h: 0, wall: new Date(0) }'

I have a clean cluster with created replicas. Restoring to 2 replicas is successful, but the KFG server keeps getting an error!

name: "2024-02-26T17:52:51.481892178Z"
opid: 65dccff3317093a07994d1be
backup: "2024-02-26T03:04:31Z"
type: logical
status: error
error: 'reply oplog: replay chunk 1708916666.1708916678: apply oplog for chunk: applying
  an entry: op: {"Timestamp":{"T":1708916672,"I":48},"Term":1,"Hash":null,"Version":2,"Operation":"i","Namespace":"config.actionlog","Object":[{"Key":"_id","Value":"zst-mongodbcfg-1:27017-2024-02-26T03:04:32.719+00:00-65dbffc06235b23d81fe4e16"},{"Key":"server","Value":"zst-mongodbcfg-1:27017"},{"Key":"shard","Value":"config"},{"Key":"clientAddr","Value":"10.212.3.138:60684"},{"Key":"time","Value":"2024-02-26T03:04:32.719Z"},{"Key":"what","Value":"balancer.stop"},{"Key":"ns","Value":""},{"Key":"details","Value":[]}],"Query":[{"Key":"_id","Value":"zst-mongodbcfg-1:27017-2024-02-26T03:04:32.719+00:00-65dbffc06235b23d81fe4e16"}],"UI":{"Subtype":4,"Data":"QhhVsGwmROm1I685D3uU7A=="},"LSID":null,"TxnNumber":null,"PrevOpTime":null}
  | merr <nil>: applyOps: (NamespaceNotFound) Failed to apply operation due to missing
  collection (421855b0-6c26-44e9-b523-af390f7b94ec): { ts: Timestamp(1708916672, 48),
  t: 1, v: 2, op: "i", ns: "config.actionlog", o: { _id: "zst-mongodbcfg-1:27017-2024-02-26T03:04:32.719+00:00-65dbffc06235b23d81fe4e16",
  server: "zst-mongodbcfg-1:27017", shard: "config", clientAddr: "10.212.3.138:60684",
  time: new Date(1708916672719), what: "balancer.stop", ns: "", details: {} }, o2:
  { _id: "zst-mongodbcfg-1:27017-2024-02-26T03:04:32.719+00:00-65dbffc06235b23d81fe4e16"
  }, ui: UUID("421855b0-6c26-44e9-b523-af390f7b94ec"), h: 0, wall: new Date(0) }'
last_transition_time: "2024-02-26T17:53:06Z"
replsets:
- name: modbcfg
  status: error
  last_transition_time: "2024-02-26T17:53:06Z"
  error: ""
- name: multi-sb
  status: done
  last_transition_time: "2024-02-26T17:53:07Z"
- name: multi-sa
  status: done
  last_transition_time: "2024-02-26T17:53:06Z"

@Dmytro_Zghoba I saw you helped in some threads. Maybe You can help me. I’m really having a crit =(

Hello Maksim, could you please share the pbm restore command you are running, with any additional flags/options? Are you using PITR feature? Please check this link for more details.

We need to understand how you are running pbm restore first so we can reproduce this error.

This may be related to your oplog records containing a balancerStop command, an exception that may have to be filtered by the restore process.

We will bring this problem to our engineers attention.

Thank you.

Hello @arlindo.neto
I really hope for your help. I start a backup with the command
pbm restore 2024-02-17T16:55:49Z --mongodb-uri “mongodb://localhost:27017/”
Also here is my config:
storage:
type: s3
s3:
region: us-east-1
bucket: backup-zst
prefix: data/pbm/backup
credentials:
access-key-id: *
secret-access-key: *
serverSideEncryption:
sseAlgorithm:aws:kms
kmsKeyID: *

I tried to create this collection and wrote into it the data that it does not have enough, but it did not help

hi @Maksim_Galaktionov ,

We have fixed it for the coming release (~ in a week).
This collection does not impact on data consistency.

For now, try to run sh.startBalancer(). If the collection is missed, mongo creates it with proper options.

If you see the error again, check if it is the same namespace. Please let us know updates

@Dmytro_Zghoba

Unfortunately this didn’t help me. I see that the collection has been created and data about the start of the balancer will appear there, but during recovery I again have an error

Feb 29 09:16:22 zst-mirror-mongodbcfg-1 pbm-agent[12531]: 2024-02-29T09:16:22.000+0000 I [restore/2024-02-29T09:16:16.674337124Z] restoring users and roles
Feb 29 09:16:22 zst-mirror-mongodbcfg-1 pbm-agent[12531]: 2024-02-29T09:16:22.000+0000 I [restore/2024-02-29T09:16:16.674337124Z] moving to state dumpDone
Feb 29 09:16:43 zst-mirror-mongodbcfg-1 pbm-agent[12531]: 2024-02-29T09:16:43.000+0000 I [restore/2024-02-29T09:16:16.674337124Z] starting oplog replay
Feb 29 09:16:43 zst-mirror-mongodbcfg-1 pbm-agent[12531]: 2024-02-29T09:16:43.000+0000 D [restore/2024-02-29T09:16:16.674337124Z] + applying {modbcfg 2024-02-29T03:04:33Z/modbcfg/local.oplog.rs.bson.gz gzip {1709175870 1} {1709175879 4} 0}
Feb 29 09:16:43 zst-mirror-mongodbcfg-1 pbm-agent[12531]: 2024-02-29T09:16:43.000+0000 E [restore/2024-02-29T09:16:16.674337124Z] restore: reply oplog: replay chunk 1709175870.1709175879: apply oplog for chunk: applying an entry: op: {“Timestamp”:{“T”:1709175874,“I”:35},“Term”:1,“Hash”:null,“Version”:2,“Operation”:“i”,“Namespace”:“config.actionlog”,“Object”:[{“Key”:“_id”,“Value”:“zst-mongodbcfg-1:27017-2024-02-29T03:04:34.642+00:00-65dff4426235b23d81d585d6”},{“Key”:“server”,“Value”:“zst-mongodbcfg-1:27017”},{“Key”:“shard”,“Value”:“config”},{“Key”:“clientAddr”,“Value”:“10.212.4.9:38816”},{“Key”:“time”,“Value”:“2024-02-29T03:04:34.642Z”},{“Key”:“what”,“Value”:“balancer.stop”},{“Key”:“ns”,“Value”:“”},{“Key”:“details”,“Value”:}],“Query”:[{“Key”:“_id”,“Value”:“zst-mongodbcfg-1:27017-2024-02-29T03:04:34.642+00:00-65dff4426235b23d81d585d6”}],“UI”:{“Subtype”:4,“Data”:“QhhVsGwmROm1I685D3uU7A==”},“LSID”:null,“TxnNumber”:null,“PrevOpTime”:null} | merr : applyOps: (NamespaceNotFound) Failed to apply operation due to missing collection (421855b0-6c26-44e9-b523-af390f7b94ec): { ts: Timestamp(1709175874, 35), t: 1, v: 2, op: “i”, ns: “config.actionlog”, o: { _id: “zst-mongodbcfg-1:27017-2024-02-29T03:04:34.642+00:00-65dff4426235b23d81d585d6”, server: “zst-mongodbcfg-1:27017”, shard: “config”, clientAddr: “10.212.4.9:38816”, time: new Date(1709175874642), what: “balancer.stop”, ns: “”, details: {} }, o2: { _id: “zst-mongodbcfg-1:27017-2024-02-29T03:04:34.642+00:00-65dff4426235b23d81d585d6” }, ui: UUID(“421855b0-6c26-44e9-b523-af390f7b94ec”), h: 0, wall: new Date(0) }

@Dmytro_Zghoba
Can I assume that my recovery was successful, and this error will just be fixed in the next release?

This collection does not impact on data consistency.

no, it is incorrect to think so. all oplog ops should be applied to be sure all data is consistency at “restore_to” time. ops can be interleaved between the insert to the config.actionlog and for other collections like config.chunks (it tells on which shards to find requested docs for sharded collections).

I do not understand what do you mean? Can I just somehow exclude this collection from recovery? I didn’t find this information in the documentation before writing here =(
Maybe you have some example of how this can be done?

as for today, it is not possible to exclude database(s)/collection(s) explicitly.

the only reason for such error, as I know, is missed collection.

could you please take a look at result of mongosh $MONGOS_URI --quiet --eval "db.getSiblingDB('config').getCollectionInfos({ 'name': 'actionlog' })" just before restore.

it should look like:

[
  {
    name: 'actionlog',
    type: 'collection',
    options: { capped: true, size: 20971520 },
    info: {
      readOnly: false,
      uuid: UUID('d9ff26d2-d356-4eb0-be60-647b8d8754e2')
    },
    idIndex: { v: 2, key: { _id: 1 }, name: '_id_' }
  }
]

ok, give me 10 minutes please. I’ll clean everything up and launch a new recovery.

@Dmytro_Zghoba

Here is the list of collections before restore
show collections

changelog
chunks
collections
databases
image_collection
lockpings
locks
migrations
mongos
reshardingOperations
shards
tags
transactions
version
system.indexBuilds
system.preimages

Here is the list of collections after running the command sh.startBalancer()
sh.startBalancer()

{
  ok: 1,
  '$clusterTime': {
    clusterTime: Timestamp({ t: 1709200441, i: 5 }),
    signature: {
      hash: Binary(Buffer.from("0000000000000000000000000000000000000000", "hex"), 0),
      keyId: Long("0")
    }
  },
  operationTime: Timestamp({ t: 1709200441, i: 5 })
}

show collections

actionlog
changelog
chunks
collections
databases
image_collection
lockpings
locks
migrations
mongos
reshardingOperations
settings
shards
tags
transactions
version
system.indexBuilds
system.preimages

Here is the output from the collection

db.getSiblingDB('config').getCollectionInfos({ 'name': 'actionlog' })
[
  {
    name: 'actionlog',
    type: 'collection',
    options: { capped: true, size: 20971520 },
    info: {
      readOnly: false,
      uuid: new UUID("7860260f-2c24-4c5e-97ff-77965e1dcefd")
    },
    idIndex: { v: 2, key: { _id: 1 }, name: '_id_' }
  }
]

There is one more nuance that can help understand the problem. I don’t see this collection in S3 backup. This is fine?

It is ok. We do not dump the collection. But we do apply changes to the collection during restoration. In the coming release, we will ignore the changes for the collection during oplog replay.

To understand it, I will explain on a high level what PBM does.

PBM logical backup consists of two phases:

  1. data dump: PBM scans and saves each document into backup files for each collection. Some collections are skipped (like config.actionlog) because they do not impact actual data.
  2. oplog dump: PBM does not lock writes to cluster. A document can be inserted, updated, or deleted at any time. We can save a document updated or deleted just after we read it. The writes are held in the oplog.rs collection for replication to secondaries. PBM keeps all these changes (including writes for config.actionlog) to know what was changed between the start and end of the scanning/reading.

Restore also has two phases:

  1. PBM restores all saved collections. the config.actionlog is not restored (it is not saved)
  2. Write changes (apply documents changes) made from the start of backup to restore_to time to make a consistent view of data at this time. PBM applies writes to the config.actionlog collection but it should not do.

ok, in general the scheme is clear. But it’s not clear how I can solve the problem =( Do I need to wait for the release of a new version to fix this? Or are there any other workarounds now?
Am I really the only one with this problem? =)

@Dmytro_Zghoba Sorry for still asking, but maybe there is some workaround that would solve my problem?

The NamespaceNotFound error is returned by mongod. If you see the namespace but, during restore, mongod does not see it, I can imagine only one case - they are different replsets - the error from a shard.

For example, if you indeed have the collection on configsvr, but do restore configsvr data on a shard.
The shard should not have the namespace (it is only used on configsvr).

PBM uses replset names to pick which replset data from the backup to restore to which running replset. Shard names do not affect this logic. This behavior can be changed by “replset remapping” flag or envvar.

This is interesting.
The replicasets are named the same, but I should have gotten an error like

Restore on replicaset “multi-cfg” in state: error: extra/unknown replica set found in the backup: modbcfg

I’m now trying to deploy a cluster with a different name for the replica set for the server config and I got the same error again. I used remapping replicas name

merr : applyOps: (NamespaceNotFound) Failed to apply operation due to missing collection

I still don’t understand the second block

For example, if you actually have a collection on configsvr, but you are restoring configsvr data on a segment.
The shard should not have a namespace (it is only used in configsvr).

Perhaps I did something wrong during the backup? Can I ask for more details on how I can find out?

It seems to me that the solution or understanding of the problem is somewhere here.

In short, make sure you restore configsvr backup to configsvr.

Let’s imagine you have a backup made from the following replsets: “a” (configsvr), “b”, “c”.
Now you are running cluster with “b” (configsvr), “c”, “d”.
PBM will not allow you to run restore because there is missed replset “a”.

You can try to quickly fix the issue by using --replset-remapping="d=a" (means, “d” ← “a” like ports mapping). In the end, it will not restore data correctly:

  • the “b” is configsvr in this target cluster, but PBM will use shard data from “b”. (shard → configsvr)
  • the “a” in the backup is from configsvr, but PBM will restore it to a shard “d”. (configsvr → shard)

In this case, on shard “d” error about config.actionlog can be shown.

The right mapping can be --replset-remapping="b=a,d=b" which means:

  • on running “b” configsvr, restore data from backed up configsvr “a”
  • on running “d” shard, restore data from b (because be is used as configsvr now)

*you can even use mapping like "b=a,c=b,d=c"

Ok I understand. But this is impossible. I create a completely identical configuration with the names of the replicas as in production(modbcfg,multi-sa, multi-sb).
And if I change the name of the config server replica(multi-cfg) and do a remap(–replset-remapping=“multi-cfg=modbcfg”), then the error remains the same. This way I ensure that the CFG replica is restored to the CFG replica.