Error when creating a backup, sharding mongodb using pbm

Hello.
We are facing an issue where PBM stops working intermittently during backup initiation. In the logs, we are encountering the following error:

2024-03-12T21:00:54Z E [server-config/192.168.0.15:27017] [backup/2024-03-12T21:00:53Z] init meta: write exception: write errors: [E11000 duplicate key error collection: admin.pbmBackups index: name_1 dup key: { name: “2024-03-12T21:00:53Z” }]
2024-03-12T21:01:27Z I [server0/192.168.0.14:27017]
[backup/2024-03-12T21:00:53Z] mark RS as error waiting for start: no backup meta, looks like a leader failed to start:
2024-03-12T21:01:27Z E [server0/192.168.0.14:27017] [backup/2024-03-12T21:00:53Z] backup: waiting for start: no backup meta, looks like a leader failed to start
2024-03-12T21:01:27Z I [server2/192.168.0.28:27017] [backup/2024-03-12T21:00:53Z] mark RS as error waiting for start: no backup meta, looks like a leader failed to start:
2024-03-12T21:01:27Z E [server2/192.168.0.28:27017] [backup/2024-03-12T21:00:53Z] backup: waiting for start: no backup meta, looks like a leader failed to start

Configuration: 3 master servers, with 2 slaves for each master, and 3 config servers.
Percona Backup for MongoDB version 2.3.1

Hi @Illia_Reshetnykov
Welcome to the community!
Upon reviewing the error, we can say that, in the namespace admin.pbmBackups it has duplicate key for the backup name 2024-03-12T21:00:53Z. As name key is unique it shouldn’t have duplicate records.

rs1 [direct: primary] admin> db.pbmBackups.getIndexes()
[
  {
    v: 2,
    key: { name: 1 },
    name: 'name_1',
    unique: true,
    sparse: true
  },

Ideally, no duplicate record should be inserted as it’s unique, but can you please check if any changes made to the namespace admin.pbmBackups manually and added any duplicate records to it.
We also want to know how you caught to this scenario, please share if you can re-produce it with the pbm logs using below:

pbm logs -e backup/backup_name -s D -t 0

Also, we recommend upgrading the pbm to latest version i.e. v2.4.0 to have improved and latest features.

Thanks,
Mukesh

Hi @Mukesh_Kumar,
Thank you for your response.
We have never manually altered the collections in pbm.
We initiate a daily backup using the “pbm backup” command by ssh in server-config. While the backup may succeed multiple times, occasionally it fails with the error mentioned earlier. Upon executing the “pbm status” command, we encounter the following message: “Error: get job status: get backup information: not found.” Following the restart of pbm-agent on all servers, “pbm status” properly displays the cluster and backup statuses, and subsequently, the “pbm backup” command initiates the backup process correctly.

root@server1 /root# pbm logs -e backup/2024-03-12T21:00:53Z -s D -t 0
2024-03-12T21:00:53Z D [server-config/192.168.0.15:27017] [backup/2024-03-12T21:00:53Z] init backup meta
2024-03-12T21:00:53Z D [server-config/192.168.0.15:27017] [backup/2024-03-12T21:00:53Z] nomination list for server2: [[192.168.0.28:27017 192.168.0.26:27017] [192.168.0.27:27017]]
2024-03-12T21:00:53Z D [server-config/192.168.0.15:27017] [backup/2024-03-12T21:00:53Z] nomination list for server-config: [[192.168.0.16:27017 192.168.0.17:27017] [192.168.0.15:27017]]
2024-03-12T21:00:53Z D [server-config/192.168.0.15:27017] [backup/2024-03-12T21:00:53Z] nomination list for server0: [[192.168.0.11:27017 192.168.0.14:27017] [192.168.0.12:27017]]
2024-03-12T21:00:53Z D [server-config/192.168.0.15:27017] [backup/2024-03-12T21:00:53Z] nomination list for server1: [[192.168.0.18:27017 192.168.0.13:27017] [192.168.0.19:27017]]
2024-03-12T21:00:53Z D [server-config/192.168.0.15:27017] [backup/2024-03-12T21:00:53Z] nomination server-config, set candidates [192.168.0.16:27017 192.168.0.17:27017]
2024-03-12T21:00:53Z D [server-config/192.168.0.15:27017] [backup/2024-03-12T21:00:53Z] nomination server2, set candidates [192.168.0.28:27017 192.168.0.26:27017]
2024-03-12T21:00:53Z D [server-config/192.168.0.15:27017] [backup/2024-03-12T21:00:53Z] nomination server1, set candidates [192.168.0.18:27017 192.168.0.13:27017]
2024-03-12T21:00:53Z D [server-config/192.168.0.15:27017] [backup/2024-03-12T21:00:53Z] nomination server0, set candidates [192.168.0.11:27017 192.168.0.14:27017]
2024-03-12T21:00:53Z I [server1/192.168.0.13:27017] [backup/2024-03-12T21:00:53Z] backup started
2024-03-12T21:00:53Z D [server1/192.168.0.18:27017] [backup/2024-03-12T21:00:53Z] skip: lock not acquired
2024-03-12T21:00:53Z D [server-config/192.168.0.16:27017] [backup/2024-03-12T21:00:53Z] get lock: another operation is running: Resync storage ‘65f0c28591e525e3e25d37cb’
2024-03-12T21:00:53Z D [server-config/192.168.0.16:27017] [backup/2024-03-12T21:00:53Z] skip: lock not acquired
2024-03-12T21:00:54Z D [server1/192.168.0.19:27017] [backup/2024-03-12T21:00:53Z] skip after nomination, probably started by another node
2024-03-12T21:00:54Z D [server0/192.168.0.11:27017] [backup/2024-03-12T21:00:53Z] skip: lock not acquired
2024-03-12T21:00:54Z I [server0/192.168.0.14:27017] [backup/2024-03-12T21:00:53Z] backup started
2024-03-12T21:00:54Z D [server0/192.168.0.12:27017] [backup/2024-03-12T21:00:53Z] skip after nomination, probably started by another node
2024-03-12T21:00:54Z I [server2/192.168.0.28:27017] [backup/2024-03-12T21:00:53Z] backup started
2024-03-12T21:00:54Z D [server2/192.168.0.26:27017] [backup/2024-03-12T21:00:53Z] skip: lock not acquired
2024-03-12T21:00:54Z E [server-config/192.168.0.15:27017] [backup/2024-03-12T21:00:53Z] init meta: write exception: write errors: [E11000 duplicate key error collection: admin.pbmBackups index: name_1 dup key: { name: “2024-03-12T21:00:53Z” }]
2024-03-12T21:00:54Z D [server2/192.168.0.27:27017] [backup/2024-03-12T21:00:53Z] skip after nomination, probably started by another node
2024-03-12T21:00:54Z D [server2/192.168.0.27:27017] [backup/2024-03-12T21:00:53Z] skip after nomination, probably started by another node
2024-03-12T21:00:54Z D [server1/192.168.0.13:27017] [backup/2024-03-12T21:00:53Z] skip after nomination, probably started by another node
2024-03-12T21:00:54Z D [server1/192.168.0.18:27017] [backup/2024-03-12T21:00:53Z] skip after nomination, probably started by another node
2024-03-12T21:00:54Z D [server0/192.168.0.12:27017] [backup/2024-03-12T21:00:53Z] skip after nomination, probably started by another node
2024-03-12T21:00:54Z D [server-config/192.168.0.16:27017] [backup/2024-03-12T21:00:53Z] get lock: another operation is running: Resync storage ‘65f0c28591e525e3e25d37cb’
2024-03-12T21:00:54Z D [server-config/192.168.0.16:27017] [backup/2024-03-12T21:00:53Z] skip: lock not acquired
2024-03-12T21:00:58Z D [server-config/192.168.0.15:27017] [backup/2024-03-12T21:00:53Z] nomination server-config, set candidates [192.168.0.15:27017]
2024-03-12T21:00:58Z D [server-config/192.168.0.15:27017] [backup/2024-03-12T21:00:53Z] nomination server1, set candidates [192.168.0.19:27017]
2024-03-12T21:00:58Z D [server-config/192.168.0.15:27017] [backup/2024-03-12T21:00:53Z] nomination server2, set candidates [192.168.0.27:27017]
2024-03-12T21:00:58Z D [server-config/192.168.0.15:27017] [backup/2024-03-12T21:00:53Z] nomination server0, set candidates [192.168.0.12:27017]
2024-03-12T21:01:08Z D [server-config/192.168.0.15:27017] [backup/2024-03-12T21:00:53Z] nomination timeout
2024-03-12T21:01:08Z D [server-config/192.168.0.15:27017] [backup/2024-03-12T21:00:53Z] skip after nomination, probably started by another node
2024-03-12T21:01:09Z D [server1/192.168.0.19:27017] [backup/2024-03-12T21:00:53Z] nomination timeout
2024-03-12T21:01:09Z D [server0/192.168.0.14:27017] [backup/2024-03-12T21:00:53Z] nomination timeout
2024-03-12T21:01:09Z D [server0/192.168.0.11:27017] [backup/2024-03-12T21:00:53Z] nomination timeout
2024-03-12T21:01:09Z D [server1/192.168.0.19:27017] [backup/2024-03-12T21:00:53Z] skip after nomination, probably started by another node
2024-03-12T21:01:09Z D [server0/192.168.0.14:27017] [backup/2024-03-12T21:00:53Z] skip after nomination, probably started by another node
2024-03-12T21:01:09Z D [server0/192.168.0.11:27017] [backup/2024-03-12T21:00:53Z] skip after nomination, probably started by another node
2024-03-12T21:01:27Z I [server0/192.168.0.14:27017] [backup/2024-03-12T21:00:53Z] mark RS as error waiting for start: no backup meta, looks like a leader failed to start:
2024-03-12T21:01:27Z E [server0/192.168.0.14:27017] [backup/2024-03-12T21:00:53Z] backup: waiting for start: no backup meta, looks like a leader failed to start
2024-03-12T21:01:27Z D [server0/192.168.0.14:27017] [backup/2024-03-12T21:00:53Z] releasing lock
2024-03-12T21:01:27Z I [server2/192.168.0.28:27017] [backup/2024-03-12T21:00:53Z] mark RS as error waiting for start: no backup meta, looks like a leader failed to start:
2024-03-12T21:01:27Z E [server2/192.168.0.28:27017] [backup/2024-03-12T21:00:53Z] backup: waiting for start: no backup meta, looks like a leader failed to start
2024-03-12T21:01:27Z D [server2/192.168.0.28:27017] [backup/2024-03-12T21:00:53Z] releasing lock
2024-03-12T21:01:29Z D [server-config/192.168.0.17:27017] [backup/2024-03-12T21:00:53Z] nomination timeout
2024-03-12T21:01:29Z D [server-config/192.168.0.17:27017] [backup/2024-03-12T21:00:53Z] skip after nomination, probably started by another node

Is it safe to manually clear collections associated with pbm?

Hi @Illia_Reshetnykov
Glad to know that now, pbm is working fine!
I can’t confirm what went wrong but looks like in pbmBackups collections it has already a record with backup name 2024-03-12T21:00:53Z and tried to create another one with same name. This race condition may occur when backup started by one of the pbm-agent, but it was killed (due to some reason) just after adding the record in the collection and another pbm-agent tried to create the backup with same name. I believe, it had happened in few milli-seconds and as backup name contains only YYYY-MM-DDTHH:MM:SSZ it couldn’t create backup with different name which cause to throw the error of

2024-03-12T21:00:54Z E [server-config/192.168.0.15:27017] [backup/2024-03-12T21:00:53Z] init meta: write exception: write errors: [E11000 duplicate key error collection: admin.pbmBackups index: name_1 dup key: { name: “2024-03-12T21:00:53Z” }]

Now, regarding the clean-up of the collection associated with pbm, we recommend to not alter/remove these pbm collections manually. Since, it contains important information about pbm-agents, pbm-configs, backups related info like logical/physical/oplog/pitr chunks, it’s always advisable to let manage the pbm-associated collection by pbm itself. It also helps to troubleshoot in some circumstances.

Thanks,
Mukesh

Hello @Mukesh_Kumar
Tell me, we are running the “pbm backup” command on only one server, how can another pbm agent located on another server try to start making a backup?

We also updated the pbm agent to the latest current version, we check whether the error will occur.

Hi @Illia_Reshetnykov
Regarding
Tell me, we are running the “pbm backup” command on only one server, how can another pbm agent located on another server try to start making a backup?
No, it’s not true.
pbm backup is a PBM Cli command while taking backup by the pbm-agent is different thing.
I would suggest going through the PBM Architect, how it works and pbm-agent. You will have a better idea on it.

Thanks,
Mukesh

Hi @Mukesh_Kumar
Today all the agents flew out again. We start the backup process like this:
Firstly, we executed the server setup command with PBM configuration by specifying various parameters such as storage type, endpoint URL, region, segment, access switch ID, secret access key, trial methods, maximum number of download parts and part size downloads:
pbm config --set Storage.type=s3 --set Storage.s3.endpointUrl="*********" --set Storage. s3.region=us-cetral-1 --set Storage.s3.bucket=********* --set Storage.s3.credentials.access-key-id=************ --set Storage.s3.credentials.secret-access-key=*********** --set Storage.s3.retryer.numMaxRetries=10 --set Storage.s3.maxUploadParts =10000 --set Storage.s3.prefix="" --set Storage.s3.uploadPartSize=100.

We then ran the PBM backup command with a mechanism to retry running the backup if necessary:

pbm backup || (sleep 60; PBM backup)

We observed the following entries in the pbm-agent log:

root@config0 /home/ireshetnykov # pbm logs -e backup/2024-03-28T21:01:19Z -s D -t 0
2024-03-28T21:01:19Z D [server-config/192.168.0.17:27017] [backup/2024-03-28T21:01:19Z] init backup meta
2024-03-28T21:01:19Z D [server-config/192.168.0.17:27017] [backup/2024-03-28T21:01:19Z] nomination list for server1: [[192.168.0.18:27017 192.168.0.13:27017] [192.168.0.19:27017]]
2024-03-28T21:01:19Z D [server-config/192.168.0.17:27017] [backup/2024-03-28T21:01:19Z] nomination list for server-config: [[192.168.0.16:27017 192.168.0.15:27017] [192.168.0.17:27017]]
2024-03-28T21:01:19Z D [server-config/192.168.0.17:27017] [backup/2024-03-28T21:01:19Z] nomination list for server0: [[192.168.0.11:27017 192.168.0.14:27017] [192.168.0.12:27017]]
2024-03-28T21:01:19Z D [server-config/192.168.0.17:27017] [backup/2024-03-28T21:01:19Z] nomination list for server2: [[192.168.0.28:27017 192.168.0.26:27017] [192.168.0.27:27017]]
2024-03-28T21:01:19Z D [server-config/192.168.0.17:27017] [backup/2024-03-28T21:01:19Z] nomination server1, set candidates [192.168.0.18:27017 192.168.0.13:27017]
2024-03-28T21:01:19Z D [server-config/192.168.0.17:27017] [backup/2024-03-28T21:01:19Z] nomination server-config, set candidates [192.168.0.16:27017 192.168.0.15:27017]
2024-03-28T21:01:19Z D [server-config/192.168.0.17:27017] [backup/2024-03-28T21:01:19Z] nomination server0, set candidates [192.168.0.11:27017 192.168.0.14:27017]
2024-03-28T21:01:19Z D [server-config/192.168.0.17:27017] [backup/2024-03-28T21:01:19Z] nomination server2, set candidates [192.168.0.28:27017 192.168.0.26:27017]
2024-03-28T21:01:20Z D [server0/192.168.0.11:27017] [backup/2024-03-28T21:01:19Z] skip: lock not acquired
2024-03-28T21:01:20Z D [server2/192.168.0.28:27017] [backup/2024-03-28T21:01:19Z] skip: lock not acquired
2024-03-28T21:01:20Z D [server-config/192.168.0.15:27017] [backup/2024-03-28T21:01:19Z] get lock: another operation is running: Resync storage '6605da9e4d597296b962f018'
2024-03-28T21:01:20Z D [server1/192.168.0.13:27017] [backup/2024-03-28T21:01:19Z] skip: lock not acquired
2024-03-28T21:01:20Z I [server0/192.168.0.14:27017] [backup/2024-03-28T21:01:19Z] backup started
2024-03-28T21:01:20Z I [server2/192.168.0.26:27017] [backup/2024-03-28T21:01:19Z] backup started
2024-03-28T21:01:20Z D [server-config/192.168.0.15:27017] [backup/2024-03-28T21:01:19Z] skip: lock not acquired
2024-03-28T21:01:20Z I [server1/192.168.0.18:27017] [backup/2024-03-28T21:01:19Z] backup started
2024-03-28T21:01:20Z D [server2/192.168.0.27:27017] [backup/2024-03-28T21:01:19Z] skip after nomination, probably started by another node
2024-03-28T21:01:20Z D [server1/192.168.0.19:27017] [backup/2024-03-28T21:01:19Z] skip after nomination, probably started by another node
2024-03-28T21:01:20Z D [server0/192.168.0.12:27017] [backup/2024-03-28T21:01:19Z] skip after nomination, probably started by another node
2024-03-28T21:01:24Z D [server-config/192.168.0.17:27017] [backup/2024-03-28T21:01:19Z] nomination server1, set candidates [192.168.0.19:27017]
2024-03-28T21:01:24Z D [server-config/192.168.0.17:27017] [backup/2024-03-28T21:01:19Z] nomination server-config, set candidates [192.168.0.17:27017]
2024-03-28T21:01:24Z D [server-config/192.168.0.17:27017] [backup/2024-03-28T21:01:19Z] nomination server0, set candidates [192.168.0.12:27017]
2024-03-28T21:01:24Z D [server-config/192.168.0.17:27017] [backup/2024-03-28T21:01:19Z] nomination server2, set candidates [192.168.0.27:27017]
2024-03-28T21:01:44Z D [server-config/192.168.0.17:27017] [backup/2024-03-28T21:01:19Z] nomination timeout
2024-03-28T21:01:44Z D [server-config/192.168.0.17:27017] [backup/2024-03-28T21:01:19Z] skip after nomination, probably started by another node
2024-03-28T21:02:00Z D [server-config/192.168.0.16:27017] [backup/2024-03-28T21:01:19Z] nomination timeout
2024-03-28T21:02:00Z D [server-config/192.168.0.16:27017] [backup/2024-03-28T21:01:19Z] skip after nomination, probably started by another node

After a while, when trying to check the status of the backup, the “pbm status” command returned the error message: “Error: get job status: get backup information: not found.”. Restarting all agents in the cluster solved the display problem, but the backup was not made. Could you give some insight into what could be causing this behavior?

Hi @Illia_Reshetnykov
Regarding the below, there is no need to start another backup within 60 secs. Are you sure that in your backup is gonna complete within 60 secs. If not, then it doesn’t make any sense.

We then ran the PBM backup command with a mechanism to retry running the backup if necessary:

pbm backup || (sleep 60; PBM backup)

As you have mentioned, agents flew out again, can you share the pbm-agent logs along with pbm config, version.
Also, from the above pbm logs, couldn’t see any issue. It looks like another backup was going on.From the logs, it seems it was running fine.

Before starting a backup, make sure to check pbm status just to make sure, if all the agents are up and no backup is going on.

Thanks,
Mukesh

Hello, @Mukesh_Kumar

During testing, we found that the problem occurs when running the pbm backup command immediately after pbm config. We believe that the reason for this is that pbm config does not have time to complete resync while pbm backup is trying to insert a new backup entry. The solution to this problem was to add the sleep 300 command after pbm config. In two weeks of operation in this configuration, pbm-agent never failed

Hi @Illia_Reshetnykov
No, it’s not, we have already tested it out. It shouldn’t be the problem.

However, we would request to raise the above said issue in another forum with debug log, pbm version etc.

We will look into it further.

Thanks,
Mukesh