Hello.
We are facing an issue where PBM stops working intermittently during backup initiation. In the logs, we are encountering the following error:
2024-03-12T21:00:54Z E [server-config/192.168.0.15:27017] [backup/2024-03-12T21:00:53Z] init meta: write exception: write errors: [E11000 duplicate key error collection: admin.pbmBackups index: name_1 dup key: { name: “2024-03-12T21:00:53Z” }]
2024-03-12T21:01:27Z I [server0/192.168.0.14:27017]
[backup/2024-03-12T21:00:53Z] mark RS as error waiting for start: no backup meta, looks like a leader failed to start
:
2024-03-12T21:01:27Z E [server0/192.168.0.14:27017] [backup/2024-03-12T21:00:53Z] backup: waiting for start: no backup meta, looks like a leader failed to start
2024-03-12T21:01:27Z I [server2/192.168.0.28:27017] [backup/2024-03-12T21:00:53Z] mark RS as error waiting for start: no backup meta, looks like a leader failed to start
:
2024-03-12T21:01:27Z E [server2/192.168.0.28:27017] [backup/2024-03-12T21:00:53Z] backup: waiting for start: no backup meta, looks like a leader failed to start
Configuration: 3 master servers, with 2 slaves for each master, and 3 config servers.
Percona Backup for MongoDB version 2.3.1
Hi @Illia_Reshetnykov
Welcome to the community!
Upon reviewing the error, we can say that, in the namespace admin.pbmBackups
it has duplicate key for the backup name 2024-03-12T21:00:53Z
. As name key is unique it shouldn’t have duplicate records.
rs1 [direct: primary] admin> db.pbmBackups.getIndexes()
[
{
v: 2,
key: { name: 1 },
name: 'name_1',
unique: true,
sparse: true
},
Ideally, no duplicate record should be inserted as it’s unique, but can you please check if any changes made to the namespace admin.pbmBackups
manually and added any duplicate records to it.
We also want to know how you caught to this scenario, please share if you can re-produce it with the pbm logs
using below:
pbm logs -e backup/backup_name -s D -t 0
Also, we recommend upgrading the pbm to latest version i.e. v2.4.0
to have improved and latest features.
Thanks,
Mukesh
Hi @Mukesh_Kumar,
Thank you for your response.
We have never manually altered the collections in pbm.
We initiate a daily backup using the “pbm backup” command by ssh in server-config. While the backup may succeed multiple times, occasionally it fails with the error mentioned earlier. Upon executing the “pbm status” command, we encounter the following message: “Error: get job status: get backup information: not found.” Following the restart of pbm-agent on all servers, “pbm status” properly displays the cluster and backup statuses, and subsequently, the “pbm backup” command initiates the backup process correctly.
root@server1 /root# pbm logs -e backup/2024-03-12T21:00:53Z -s D -t 0
2024-03-12T21:00:53Z D [server-config/192.168.0.15:27017] [backup/2024-03-12T21:00:53Z] init backup meta
2024-03-12T21:00:53Z D [server-config/192.168.0.15:27017] [backup/2024-03-12T21:00:53Z] nomination list for server2: [[192.168.0.28:27017 192.168.0.26:27017] [192.168.0.27:27017]]
2024-03-12T21:00:53Z D [server-config/192.168.0.15:27017] [backup/2024-03-12T21:00:53Z] nomination list for server-config: [[192.168.0.16:27017 192.168.0.17:27017] [192.168.0.15:27017]]
2024-03-12T21:00:53Z D [server-config/192.168.0.15:27017] [backup/2024-03-12T21:00:53Z] nomination list for server0: [[192.168.0.11:27017 192.168.0.14:27017] [192.168.0.12:27017]]
2024-03-12T21:00:53Z D [server-config/192.168.0.15:27017] [backup/2024-03-12T21:00:53Z] nomination list for server1: [[192.168.0.18:27017 192.168.0.13:27017] [192.168.0.19:27017]]
2024-03-12T21:00:53Z D [server-config/192.168.0.15:27017] [backup/2024-03-12T21:00:53Z] nomination server-config, set candidates [192.168.0.16:27017 192.168.0.17:27017]
2024-03-12T21:00:53Z D [server-config/192.168.0.15:27017] [backup/2024-03-12T21:00:53Z] nomination server2, set candidates [192.168.0.28:27017 192.168.0.26:27017]
2024-03-12T21:00:53Z D [server-config/192.168.0.15:27017] [backup/2024-03-12T21:00:53Z] nomination server1, set candidates [192.168.0.18:27017 192.168.0.13:27017]
2024-03-12T21:00:53Z D [server-config/192.168.0.15:27017] [backup/2024-03-12T21:00:53Z] nomination server0, set candidates [192.168.0.11:27017 192.168.0.14:27017]
2024-03-12T21:00:53Z I [server1/192.168.0.13:27017] [backup/2024-03-12T21:00:53Z] backup started
2024-03-12T21:00:53Z D [server1/192.168.0.18:27017] [backup/2024-03-12T21:00:53Z] skip: lock not acquired
2024-03-12T21:00:53Z D [server-config/192.168.0.16:27017] [backup/2024-03-12T21:00:53Z] get lock: another operation is running: Resync storage ‘65f0c28591e525e3e25d37cb’
2024-03-12T21:00:53Z D [server-config/192.168.0.16:27017] [backup/2024-03-12T21:00:53Z] skip: lock not acquired
2024-03-12T21:00:54Z D [server1/192.168.0.19:27017] [backup/2024-03-12T21:00:53Z] skip after nomination, probably started by another node
2024-03-12T21:00:54Z D [server0/192.168.0.11:27017] [backup/2024-03-12T21:00:53Z] skip: lock not acquired
2024-03-12T21:00:54Z I [server0/192.168.0.14:27017] [backup/2024-03-12T21:00:53Z] backup started
2024-03-12T21:00:54Z D [server0/192.168.0.12:27017] [backup/2024-03-12T21:00:53Z] skip after nomination, probably started by another node
2024-03-12T21:00:54Z I [server2/192.168.0.28:27017] [backup/2024-03-12T21:00:53Z] backup started
2024-03-12T21:00:54Z D [server2/192.168.0.26:27017] [backup/2024-03-12T21:00:53Z] skip: lock not acquired
2024-03-12T21:00:54Z E [server-config/192.168.0.15:27017] [backup/2024-03-12T21:00:53Z] init meta: write exception: write errors: [E11000 duplicate key error collection: admin.pbmBackups index: name_1 dup key: { name: “2024-03-12T21:00:53Z” }]
2024-03-12T21:00:54Z D [server2/192.168.0.27:27017] [backup/2024-03-12T21:00:53Z] skip after nomination, probably started by another node
2024-03-12T21:00:54Z D [server2/192.168.0.27:27017] [backup/2024-03-12T21:00:53Z] skip after nomination, probably started by another node
2024-03-12T21:00:54Z D [server1/192.168.0.13:27017] [backup/2024-03-12T21:00:53Z] skip after nomination, probably started by another node
2024-03-12T21:00:54Z D [server1/192.168.0.18:27017] [backup/2024-03-12T21:00:53Z] skip after nomination, probably started by another node
2024-03-12T21:00:54Z D [server0/192.168.0.12:27017] [backup/2024-03-12T21:00:53Z] skip after nomination, probably started by another node
2024-03-12T21:00:54Z D [server-config/192.168.0.16:27017] [backup/2024-03-12T21:00:53Z] get lock: another operation is running: Resync storage ‘65f0c28591e525e3e25d37cb’
2024-03-12T21:00:54Z D [server-config/192.168.0.16:27017] [backup/2024-03-12T21:00:53Z] skip: lock not acquired
2024-03-12T21:00:58Z D [server-config/192.168.0.15:27017] [backup/2024-03-12T21:00:53Z] nomination server-config, set candidates [192.168.0.15:27017]
2024-03-12T21:00:58Z D [server-config/192.168.0.15:27017] [backup/2024-03-12T21:00:53Z] nomination server1, set candidates [192.168.0.19:27017]
2024-03-12T21:00:58Z D [server-config/192.168.0.15:27017] [backup/2024-03-12T21:00:53Z] nomination server2, set candidates [192.168.0.27:27017]
2024-03-12T21:00:58Z D [server-config/192.168.0.15:27017] [backup/2024-03-12T21:00:53Z] nomination server0, set candidates [192.168.0.12:27017]
2024-03-12T21:01:08Z D [server-config/192.168.0.15:27017] [backup/2024-03-12T21:00:53Z] nomination timeout
2024-03-12T21:01:08Z D [server-config/192.168.0.15:27017] [backup/2024-03-12T21:00:53Z] skip after nomination, probably started by another node
2024-03-12T21:01:09Z D [server1/192.168.0.19:27017] [backup/2024-03-12T21:00:53Z] nomination timeout
2024-03-12T21:01:09Z D [server0/192.168.0.14:27017] [backup/2024-03-12T21:00:53Z] nomination timeout
2024-03-12T21:01:09Z D [server0/192.168.0.11:27017] [backup/2024-03-12T21:00:53Z] nomination timeout
2024-03-12T21:01:09Z D [server1/192.168.0.19:27017] [backup/2024-03-12T21:00:53Z] skip after nomination, probably started by another node
2024-03-12T21:01:09Z D [server0/192.168.0.14:27017] [backup/2024-03-12T21:00:53Z] skip after nomination, probably started by another node
2024-03-12T21:01:09Z D [server0/192.168.0.11:27017] [backup/2024-03-12T21:00:53Z] skip after nomination, probably started by another node
2024-03-12T21:01:27Z I [server0/192.168.0.14:27017] [backup/2024-03-12T21:00:53Z] mark RS as error waiting for start: no backup meta, looks like a leader failed to start
:
2024-03-12T21:01:27Z E [server0/192.168.0.14:27017] [backup/2024-03-12T21:00:53Z] backup: waiting for start: no backup meta, looks like a leader failed to start
2024-03-12T21:01:27Z D [server0/192.168.0.14:27017] [backup/2024-03-12T21:00:53Z] releasing lock
2024-03-12T21:01:27Z I [server2/192.168.0.28:27017] [backup/2024-03-12T21:00:53Z] mark RS as error waiting for start: no backup meta, looks like a leader failed to start
:
2024-03-12T21:01:27Z E [server2/192.168.0.28:27017] [backup/2024-03-12T21:00:53Z] backup: waiting for start: no backup meta, looks like a leader failed to start
2024-03-12T21:01:27Z D [server2/192.168.0.28:27017] [backup/2024-03-12T21:00:53Z] releasing lock
2024-03-12T21:01:29Z D [server-config/192.168.0.17:27017] [backup/2024-03-12T21:00:53Z] nomination timeout
2024-03-12T21:01:29Z D [server-config/192.168.0.17:27017] [backup/2024-03-12T21:00:53Z] skip after nomination, probably started by another node
Is it safe to manually clear collections associated with pbm?
Hi @Illia_Reshetnykov
Glad to know that now, pbm is working fine!
I can’t confirm what went wrong but looks like in pbmBackups
collections it has already a record with backup name 2024-03-12T21:00:53Z
and tried to create another one with same name. This race condition may occur when backup started by one of the pbm-agent, but it was killed (due to some reason) just after adding the record in the collection and another pbm-agent tried to create the backup with same name. I believe, it had happened in few milli-seconds and as backup name contains only YYYY-MM-DDTHH:MM:SSZ
it couldn’t create backup with different name which cause to throw the error of
2024-03-12T21:00:54Z E [server-config/192.168.0.15:27017] [backup/2024-03-12T21:00:53Z] init meta: write exception: write errors: [E11000 duplicate key error collection: admin.pbmBackups index: name_1 dup key: { name: “2024-03-12T21:00:53Z” }]
Now, regarding the clean-up of the collection associated with pbm, we recommend to not alter/remove these pbm collections manually. Since, it contains important information about pbm-agents, pbm-configs, backups related info like logical/physical/oplog/pitr chunks, it’s always advisable to let manage the pbm-associated collection by pbm itself. It also helps to troubleshoot in some circumstances.
Thanks,
Mukesh
Hello @Mukesh_Kumar
Tell me, we are running the “pbm backup” command on only one server, how can another pbm agent located on another server try to start making a backup?
We also updated the pbm agent to the latest current version, we check whether the error will occur.
Hi @Illia_Reshetnykov
Regarding
Tell me, we are running the “pbm backup” command on only one server, how can another pbm agent located on another server try to start making a backup?
No, it’s not true.
pbm backup
is a PBM Cli command while taking backup by the pbm-agent is different thing.
I would suggest going through the PBM Architect, how it works and pbm-agent. You will have a better idea on it.
Thanks,
Mukesh
Hi @Mukesh_Kumar
Today all the agents flew out again. We start the backup process like this:
Firstly, we executed the server setup command with PBM configuration by specifying various parameters such as storage type, endpoint URL, region, segment, access switch ID, secret access key, trial methods, maximum number of download parts and part size downloads:
pbm config --set Storage.type=s3 --set Storage.s3.endpointUrl="*********" --set Storage. s3.region=us-cetral-1 --set Storage.s3.bucket=********* --set Storage.s3.credentials.access-key-id=************ --set Storage.s3.credentials.secret-access-key=*********** --set Storage.s3.retryer.numMaxRetries=10 --set Storage.s3.maxUploadParts =10000 --set Storage.s3.prefix="" --set Storage.s3.uploadPartSize=100.
We then ran the PBM backup command with a mechanism to retry running the backup if necessary:
pbm backup || (sleep 60; PBM backup)
We observed the following entries in the pbm-agent log:
root@config0 /home/ireshetnykov # pbm logs -e backup/2024-03-28T21:01:19Z -s D -t 0
2024-03-28T21:01:19Z D [server-config/192.168.0.17:27017] [backup/2024-03-28T21:01:19Z] init backup meta
2024-03-28T21:01:19Z D [server-config/192.168.0.17:27017] [backup/2024-03-28T21:01:19Z] nomination list for server1: [[192.168.0.18:27017 192.168.0.13:27017] [192.168.0.19:27017]]
2024-03-28T21:01:19Z D [server-config/192.168.0.17:27017] [backup/2024-03-28T21:01:19Z] nomination list for server-config: [[192.168.0.16:27017 192.168.0.15:27017] [192.168.0.17:27017]]
2024-03-28T21:01:19Z D [server-config/192.168.0.17:27017] [backup/2024-03-28T21:01:19Z] nomination list for server0: [[192.168.0.11:27017 192.168.0.14:27017] [192.168.0.12:27017]]
2024-03-28T21:01:19Z D [server-config/192.168.0.17:27017] [backup/2024-03-28T21:01:19Z] nomination list for server2: [[192.168.0.28:27017 192.168.0.26:27017] [192.168.0.27:27017]]
2024-03-28T21:01:19Z D [server-config/192.168.0.17:27017] [backup/2024-03-28T21:01:19Z] nomination server1, set candidates [192.168.0.18:27017 192.168.0.13:27017]
2024-03-28T21:01:19Z D [server-config/192.168.0.17:27017] [backup/2024-03-28T21:01:19Z] nomination server-config, set candidates [192.168.0.16:27017 192.168.0.15:27017]
2024-03-28T21:01:19Z D [server-config/192.168.0.17:27017] [backup/2024-03-28T21:01:19Z] nomination server0, set candidates [192.168.0.11:27017 192.168.0.14:27017]
2024-03-28T21:01:19Z D [server-config/192.168.0.17:27017] [backup/2024-03-28T21:01:19Z] nomination server2, set candidates [192.168.0.28:27017 192.168.0.26:27017]
2024-03-28T21:01:20Z D [server0/192.168.0.11:27017] [backup/2024-03-28T21:01:19Z] skip: lock not acquired
2024-03-28T21:01:20Z D [server2/192.168.0.28:27017] [backup/2024-03-28T21:01:19Z] skip: lock not acquired
2024-03-28T21:01:20Z D [server-config/192.168.0.15:27017] [backup/2024-03-28T21:01:19Z] get lock: another operation is running: Resync storage '6605da9e4d597296b962f018'
2024-03-28T21:01:20Z D [server1/192.168.0.13:27017] [backup/2024-03-28T21:01:19Z] skip: lock not acquired
2024-03-28T21:01:20Z I [server0/192.168.0.14:27017] [backup/2024-03-28T21:01:19Z] backup started
2024-03-28T21:01:20Z I [server2/192.168.0.26:27017] [backup/2024-03-28T21:01:19Z] backup started
2024-03-28T21:01:20Z D [server-config/192.168.0.15:27017] [backup/2024-03-28T21:01:19Z] skip: lock not acquired
2024-03-28T21:01:20Z I [server1/192.168.0.18:27017] [backup/2024-03-28T21:01:19Z] backup started
2024-03-28T21:01:20Z D [server2/192.168.0.27:27017] [backup/2024-03-28T21:01:19Z] skip after nomination, probably started by another node
2024-03-28T21:01:20Z D [server1/192.168.0.19:27017] [backup/2024-03-28T21:01:19Z] skip after nomination, probably started by another node
2024-03-28T21:01:20Z D [server0/192.168.0.12:27017] [backup/2024-03-28T21:01:19Z] skip after nomination, probably started by another node
2024-03-28T21:01:24Z D [server-config/192.168.0.17:27017] [backup/2024-03-28T21:01:19Z] nomination server1, set candidates [192.168.0.19:27017]
2024-03-28T21:01:24Z D [server-config/192.168.0.17:27017] [backup/2024-03-28T21:01:19Z] nomination server-config, set candidates [192.168.0.17:27017]
2024-03-28T21:01:24Z D [server-config/192.168.0.17:27017] [backup/2024-03-28T21:01:19Z] nomination server0, set candidates [192.168.0.12:27017]
2024-03-28T21:01:24Z D [server-config/192.168.0.17:27017] [backup/2024-03-28T21:01:19Z] nomination server2, set candidates [192.168.0.27:27017]
2024-03-28T21:01:44Z D [server-config/192.168.0.17:27017] [backup/2024-03-28T21:01:19Z] nomination timeout
2024-03-28T21:01:44Z D [server-config/192.168.0.17:27017] [backup/2024-03-28T21:01:19Z] skip after nomination, probably started by another node
2024-03-28T21:02:00Z D [server-config/192.168.0.16:27017] [backup/2024-03-28T21:01:19Z] nomination timeout
2024-03-28T21:02:00Z D [server-config/192.168.0.16:27017] [backup/2024-03-28T21:01:19Z] skip after nomination, probably started by another node
After a while, when trying to check the status of the backup, the “pbm status” command returned the error message: “Error: get job status: get backup information: not found.”. Restarting all agents in the cluster solved the display problem, but the backup was not made. Could you give some insight into what could be causing this behavior?
Hi @Illia_Reshetnykov
Regarding the below, there is no need to start another backup within 60 secs. Are you sure that in your backup is gonna complete within 60 secs. If not, then it doesn’t make any sense.
We then ran the PBM backup command with a mechanism to retry running the backup if necessary:
pbm backup || (sleep 60; PBM backup)
As you have mentioned, agents flew out again, can you share the pbm-agent logs along with pbm config, version.
Also, from the above pbm logs, couldn’t see any issue. It looks like another backup was going on.From the logs, it seems it was running fine.
Before starting a backup, make sure to check pbm status just to make sure, if all the agents are up and no backup is going on.
Thanks,
Mukesh
Hello, @Mukesh_Kumar
During testing, we found that the problem occurs when running the pbm backup
command immediately after pbm config
. We believe that the reason for this is that pbm config
does not have time to complete resync while pbm backup
is trying to insert a new backup entry. The solution to this problem was to add the sleep 300
command after pbm config
. In two weeks of operation in this configuration, pbm-agent never failed
Hi @Illia_Reshetnykov
No, it’s not, we have already tested it out. It shouldn’t be the problem.
However, we would request to raise the above said issue in another forum with debug log, pbm version etc.
We will look into it further.
Thanks,
Mukesh
Here’s a simple script that incorporates your solution:
# Configure pbm
pbm config --set <configuration-parameters>
# Wait for 300 seconds to allow pbm config to complete resync
sleep 300
# Run pbm backup
pbm backup --description "Scheduled backup after config resync"
Autodesk software online store.
This script sets a delay of 300 seconds (5 minutes) between the configuration and the backup command, which should mitigate the issue you were facing.
Hi @beredis
You don’t need to configure the pbm again and again. It only needs to configure once and if any changes require then config it again.
Thanks,
Mukesh