Backup Error: get: context deadline exceeded

Hello, I’ve got some problem.

my cluster is made up of 3 nodes (MongoDB 7.0.8 Community - PSA topology) in replset.
I installed pbm-agent (v2.5.0) on the two data nodes.

I also have a dump server that exports a path to the data nodes via NFS, on this I installed and configured the pbm-cli (same version of the agents).
I wrote a simple bash script to perform logical backups and put it in crontab.

Everything seems to work fine but sometimes, randomly, the backup fails with this error: “Error: get backup metadata: get: context deadline exceeded”.

this is my pbm_config.yaml:

pitr:
enabled: false
oplogSpanMin: 10
oplogOnly: false
compression: s2
storage:
type: filesystem
filesystem:
path: /mongodb/backup
backups:
compression: s2
timeouts:
startingStatus: 60

I only added later the timeout.startingStatus part (which should concern physical backups) in order to try to solve the problem without success.
What am I doing wrong?

Thank you for your help, cheers

Giuseppe

@uff.DBA

“Error: get backup metadata: get: context deadline exceeded” .

Can you please gather a few details from your PBM environment to review ?

shell> pbm logs
shell> journalctl -u pbm-agent.service
shell> pbm describe-backup backup_name ### the concerned backup

startingStatus: 60

This should be fine. PBM will wait for this much time to start physical backups on all shards. Increasing this value helps when it takes longer than usual to open the **$backupCursor**.

Do you have the kernel logs [/var/log/messages or dMesg etc] for a review ?

Can you also please share the PBM backup/status information ?

shell> pbm status
shell> pbm list

Was the backend MongoDB/Mongos nodes running fine around the period and no such networking or other fluctuations oberved ?

Do you have any logging of that script where we can check the complete logs ? Can you please share the same along with the backup command details ?

Regards,
Anil

Hi Anil,
thank you for your kind answer.
I will try to provide you with all the details you requested.

First of all I am trying to do a logical backup, so if I correctly understood how the “startingStatus: 60” configuration works this should be irrelevant because it only applies in the case of physical backups, can you confirm this?

I checked /var/log/messages and only on one of the two data nodes where the pbm-agent is running do I find this:

pbm-agent[3059366]: 2024-07-15T23:00:02.000+0200 I got command cleanup <ts: 1721077201>
pbm-agent[3059366]: 2024-07-15T23:00:02.000+0200 I got epoch {1721077201 1}
pbm-agent[3059366]: 2024-07-15T23:00:02.000+0200 D [cleanup] skip: lock not acquired
pbm-agent[3059366]: 2024-07-15T23:00:02.000+0200 I got epoch {1721077201 1}
pbm-agent[3059366]: 2024-07-15T23:00:02.000+0200 D [cleanup] skip: lock not acquired
pbm-agent[3059366]: 2024-07-15T23:00:02.000+0200 I got command backup [name: 2024-07-15T21:00:01Z, compression: s2 (level: default)] <ts: 1721077201>
pbm-agent[3059366]: 2024-07-15T23:00:02.000+0200 I got command backup [name: 2024-07-15T21:00:01Z, compression: s2 (level: default)] <ts: 1721077201>
pbm-agent[3059366]: 2024-07-15T23:00:02.000+0200 I got epoch {1721077201 1}
pbm-agent[3059366]: 2024-07-15T23:00:02.000+0200 I got epoch {1721077201 1}
pbm-agent[3059366]: 2024-07-15T23:00:02.000+0200 D [backup/2024-07-15T21:00:01Z] init backup meta
pbm-agent[3059366]: 2024-07-15T23:00:02.000+0200 D [backup/2024-07-15T21:00:01Z] init backup meta
pbm-agent[3059366]: 2024-07-15T23:00:02.000+0200 D [backup/2024-07-15T21:00:01Z] nomination list for MongoDBReplset1: [[data_node1:27101] [data_node2:27102]]
pbm-agent[3059366]: 2024-07-15T23:00:02.000+0200 D [backup/2024-07-15T21:00:01Z] nomination list for MongoDBReplset1: [[data_node1:27101] [data_node2:27102]]
pbm-agent[3059366]: 2024-07-15T23:00:02.000+0200 D [backup/2024-07-15T21:00:01Z] nomination MongoDBReplset1, set candidates [data_node1:27101]
pbm-agent[3059366]: 2024-07-15T23:00:02.000+0200 D [backup/2024-07-15T21:00:01Z] nomination MongoDBReplset1, set candidates [data_node1:27101]
pbm-agent[3059366]: 2024-07-15T23:00:07.000+0200 D [backup/2024-07-15T21:00:01Z] nomination MongoDBReplset1, set candidates [data_node2:27102]
pbm-agent[3059366]: 2024-07-15T23:00:07.000+0200 D [backup/2024-07-15T21:00:01Z] nomination MongoDBReplset1, set candidates [data_node2:27102]
pbm-agent[3059366]: 2024-07-15T23:00:27.000+0200 D [backup/2024-07-15T21:00:01Z] nomination timeout
pbm-agent[3059366]: 2024-07-15T23:00:27.000+0200 D [backup/2024-07-15T21:00:01Z] nomination timeout
pbm-agent[3059366]: 2024-07-15T23:00:27.000+0200 D [backup/2024-07-15T21:00:01Z] skip after nomination, probably started by another node
pbm-agent[3059366]: 2024-07-15T23:00:27.000+0200 D [backup/2024-07-15T21:00:01Z] skip after nomination, probably started by another node

it seems to be trying to find a candidate but failing. Mongo logs contain nothing relevant.
I paste below the other required logs:

pbm status

Cluster:
========
MongoDBReplset1:
  - MongoDBReplset1/data_node1:27101 [S]: pbm-agent v2.5.0 OK
  - MongoDBReplset1/data_node2:27102 [P]: pbm-agent v2.5.0 OK
  - MongoDBReplset1/arbiter_node:27110 [!Arbiter]: arbiter node is not supported


PITR incremental backup:
========================
Status [OFF]

Currently running:
==================
(none)

Backups:
========
FS  /dump/mongo_backup
  (none)

pbm list

Backup snapshots:

pbm logs

2024-07-13T21:00:02Z I [MongoDBReplset1/data_node2:27102] got epoch {1720904401 1}
2024-07-13T21:00:02Z I [MongoDBReplset1/data_node1:27101] got command backup [name: 2024-07-13T21:00:01Z, compression: s2 (level: default)] <ts: 1720904401>
2024-07-13T21:00:02Z I [MongoDBReplset1/data_node1:27101] got epoch {1720904401 1}
2024-07-14T21:00:02Z I [MongoDBReplset1/data_node1:27101] got command backup [name: 2024-07-14T21:00:02Z, compression: s2 (level: default)] <ts: 1720990802>
2024-07-14T21:00:02Z I [MongoDBReplset1/data_node1:27101] got epoch {1720990801 2}
2024-07-14T21:00:02Z I [MongoDBReplset1/data_node2:27102] got command backup [name: 2024-07-14T21:00:02Z, compression: s2 (level: default)] <ts: 1720990802>
2024-07-14T21:00:02Z I [MongoDBReplset1/data_node2:27102] got epoch {1720990801 2}
2024-07-14T21:00:03Z I [MongoDBReplset1/data_node1:27101] [backup/2024-07-14T21:00:02Z] backup started
2024-07-14T21:05:21Z I [MongoDBReplset1/data_node1:27101] [backup/2024-07-14T21:00:02Z] mongodump finished, waiting for the oplog
2024-07-14T21:05:26Z I [MongoDBReplset1/data_node1:27101] [backup/2024-07-14T21:00:02Z] created chunk 2024-07-14T21:00:03 - 2024-07-14T21:05:23
2024-07-14T21:05:26Z I [MongoDBReplset1/data_node1:27101] [backup/2024-07-14T21:00:02Z] dropping tmp collections
2024-07-14T21:05:28Z I [MongoDBReplset1/data_node1:27101] [backup/2024-07-14T21:00:02Z] backup finished
2024-07-15T21:00:02Z I [MongoDBReplset1/data_node1:27101] got command cleanup <ts: 1721077201>
2024-07-15T21:00:02Z I [MongoDBReplset1/data_node1:27101] got epoch {1721077201 1}
2024-07-15T21:00:02Z I [MongoDBReplset1/data_node2:27102] got command cleanup <ts: 1721077201>
2024-07-15T21:00:02Z I [MongoDBReplset1/data_node2:27102] got epoch {1721077201 1}
2024-07-15T21:00:02Z I [MongoDBReplset1/data_node2:27102] got command backup [name: 2024-07-15T21:00:01Z, compression: s2 (level: default)] <ts: 1721077201>
2024-07-15T21:00:02Z I [MongoDBReplset1/data_node2:27102] got epoch {1721077201 1}
2024-07-15T21:00:02Z I [MongoDBReplset1/data_node1:27101] got command backup [name: 2024-07-15T21:00:01Z, compression: s2 (level: default)] <ts: 1721077201>
2024-07-15T21:00:02Z I [MongoDBReplset1/data_node1:27101] got epoch {1721077201 1}

journalctl -u pbm-agent.service

Jul 15 23:00:02 srvmongodata1 pbm-agent[1315]: 2024-07-15T23:00:02.000+0200 I got command cleanup <ts: 1721077201>
Jul 15 23:00:02 srvmongodata1 pbm-agent[1315]: 2024-07-15T23:00:02.000+0200 I got epoch {1721077201 1}
Jul 15 23:00:02 srvmongodata1 pbm-agent[1315]: 2024-07-15T23:00:02.000+0200 D [cleanup] got physical restores list: 0
Jul 15 23:00:02 srvmongodata1 pbm-agent[1315]: 2024-07-15T23:00:02.000+0200 D [cleanup] got backups list: 0
Jul 15 23:00:02 srvmongodata1 pbm-agent[1315]: 2024-07-15T23:00:02.000+0200 I got command backup [name: 2024-07-15T21:00:01Z, compression: s2 (level: default)] <ts: 1721077201>
Jul 15 23:00:02 srvmongodata1 pbm-agent[1315]: 2024-07-15T23:00:02.000+0200 I got epoch {1721077201 1}
Jul 15 23:00:17 srvmongodata1 pbm-agent[1315]: 2024-07-15T23:00:17.000+0200 D [backup/2024-07-15T21:00:01Z] nomination timeout
Jul 15 23:00:17 srvmongodata1 pbm-agent[1315]: 2024-07-15T23:00:17.000+0200 D [backup/2024-07-15T21:00:01Z] skip after nomination, probably started by another node

Jul 15 23:00:02 srvmongodata2 pbm-agent[3059366]: 2024-07-15T23:00:02.000+0200 I got command cleanup <ts: 1721077201>
Jul 15 23:00:02 srvmongodata2 pbm-agent[3059366]: 2024-07-15T23:00:02.000+0200 I got epoch {1721077201 1}
Jul 15 23:00:02 srvmongodata2 pbm-agent[3059366]: 2024-07-15T23:00:02.000+0200 D [cleanup] skip: lock not acquired
Jul 15 23:00:02 srvmongodata2 pbm-agent[3059366]: 2024-07-15T23:00:02.000+0200 I got command backup [name: 2024-07-15T21:00:01Z, compression: s2 (level: default)] <ts: 1721077201>
Jul 15 23:00:02 srvmongodata2 pbm-agent[3059366]: 2024-07-15T23:00:02.000+0200 I got epoch {1721077201 1}
Jul 15 23:00:02 srvmongodata2 pbm-agent[3059366]: 2024-07-15T23:00:02.000+0200 D [backup/2024-07-15T21:00:01Z] init backup meta
Jul 15 23:00:02 srvmongodata2 pbm-agent[3059366]: 2024-07-15T23:00:02.000+0200 D [backup/2024-07-15T21:00:01Z] nomination list for MongoDBReplset1: [[data_node1:27101] [data_node2:27102>
Jul 15 23:00:02 srvmongodata2 pbm-agent[3059366]: 2024-07-15T23:00:02.000+0200 D [backup/2024-07-15T21:00:01Z] nomination MongoDBReplset1, set candidates [data_node1:27101]
Jul 15 23:00:07 srvmongodata2 pbm-agent[3059366]: 2024-07-15T23:00:07.000+0200 D [backup/2024-07-15T21:00:01Z] nomination MongoDBReplset1, set candidates [data_node2:27102]
Jul 15 23:00:27 srvmongodata2 pbm-agent[3059366]: 2024-07-15T23:00:27.000+0200 D [backup/2024-07-15T21:00:01Z] nomination timeout
Jul 15 23:00:27 srvmongodata2 pbm-agent[3059366]: 2024-07-15T23:00:27.000+0200 D [backup/2024-07-15T21:00:01Z] skip after nomination, probably started by another node

I’m not sure how pbm decides which node is the best candidate to perform the backup, does it use mongo’s election mechanism? Is it possible that I also have to install the pbm-agent on the arbiter node otherwise it won’t be able to nominate a candidate?

Thank you again for your patience and help
Regards,

Giuseppe

@uff.DBA

Thanks for sharing the details,

Jul 15 23:00:17 srvmongodata1 pbm-agent[1315]: 2024-07-15T23:00:17.000+0200 D [backup/2024-07-15T21:00:01Z] nomination timeout
Jul 15 23:00:17 srvmongodata1 pbm-agent[1315]: 2024-07-15T23:00:17.000+0200 D [backup/2024-07-15T21:00:01Z] skip after nomination, probably started by another node

In one of reported bugs - [PBM-1316] - Percona JIRA I observed some similar error resemblance “nomination timeout” . The affected version is not the same, its v2.4.1 however still there are chances it could occurrs.

By any chance are you deleting some old PBM backups when your backup is running ? Did you check if nothing wrong with the NFS, any detachment etc around that time ?

pbm-agent[3059366]: 2024-07-15T23:00:02.000+0200 D [cleanup] skip: lock not acquired
Jul 15 23:00:02 srvmongodata2 pbm-agent[3059366]: 2024-07-15T23:00:02.000+0200 I got command cleanup <ts: 1721077201>

From the above logs it appears some cleaning activity was running ? Can you please confirm that ?

Can you please share the kernel logs [/var/log/messages OR Dmesg] in order to check any OS side thingy as well ?

I’m not sure how pbm decides which node is the best candidate to perform the backup, does it use mongo’s election mechanism?

Basically PBM backup, by default, will use the secondary nodes for backup based on random election criteria, and in case no secondaries respond, then the backup will be initiated on the primary.

You can control the election behavior by defining a priority for Mongo nodes in the configuration file [/etc/pbm_config.yaml].

backup:
  priority:
    localhost:27019: 2.5
    localhost:28021: 2.5

If several nodes have the same priority, the one among them is randomly elected to make a backup.

Is it possible that I also have to install the pbm-agent on the arbiter node otherwise it won’t be able to nominate a candidate?

This should not be a problem. The arbiter node doesn’t contain any data bearing node. Moreover , as I mentioned above you can also control the backup node priority via PBM configuration directly.

Bye the way, how frequently this issue is triggering and exactly when did you start observing the issue first ?

Did you check if Mongo data nodes were fine around the issue period and there is no failover/switchover happen there. Still, do you have the Mongo logs containing the issue timelines to verify if no issues there ?

Hi Anil,
thanks again for your advices. I think you hit the nail on the head.

the error was generated by the presence (parallel?) of a deletion task and a backup task.

let me explain better: in my bash script before proceeding with the cluster backup I went to look for the old backups to delete them with the cleanup command.
then I cyclically executed the status command and used a regular expression to go and verify that in the command output there was nothing “running”. only then I proceed with the backup. in this step something went wrong.

I split the deletion phase from the backup phase and everything seems to be ok.

it would be nice to have better tools to query the activity of the agents without evaluating the output of the “status” command as I did