Mongodb backup errors - percona backup

Getting an error as below on Mongodb backup for PBM,

Any methods to mitigiate,

!Failed to run PITR backup. Agent logs:
rs0: PITR backup didn’t started
rs:0 2021-02-01T… [ERROR] pitr: streaming oplog: undefinded behaviour operation is running

in addition getting something as,

Error starting backup: no confirmation that backup has successfully started. Replsets status:

Hi @deepaklo
Can you share please the output of:

  1. pbm status
  2. pbm logs -t 0

“Error starting backup: …” pops up when you’re trying to run pbm backup?

I’m having the exact same issue.

pbm status yields:
Cluster:

========

mongo-prod-set:

  • mongo-prod-set/mongodb2-prod:27017: pbm-agent NOT FOUND FAILED status:

  • mongo-prod-set/mongodb1-prod:27017: pbm-agent NOT FOUND FAILED status:

  • mongo-prod-set/mongodb3-prod:27017: pbm-agent NOT FOUND FAILED status:

PITR incremental backup:

========================

Status [ON]

Currently running:

==================

(none)

pbm log -t 0 doesn’t retrieve anything

Also of note, I’m still running v1.3 as there is a dependency conflict with the persona-release repo that’s preventing me from upgrading.

Disregard. I got it working again by following these steps:

  1. Restarted the agent on all of the servers
  2. Disabled PITR
  3. Executed a pbm backup
  4. Re-enabled PITR

Also, I discovered that pbm is updating on my Amazon Linux 2 boxes, just not the old Amazon Linux 2018.03 box that I use for executing the backups.

@TimSandberg
Yes, both pbm status and pbm logs are introduced in v1.4. So older version agents unable to communicate properly with the cli for that matters.

@Andrew_Pogrebnoi

Sorry for the delayed response,

I just got the error once again and the backups are not running after that, error details as below,

Backup snapshots:
2021-03-20T06:00:01Z
PITR :
2021-03-20T06:49:20 - 2021-03-20T07:50:06

!Failed to run PITR backup. Agent logs:
rs0: PITR backup didn’t started
rs0: 2021-03-20T08:01:09.000+0000 [ERROR] pitr: streaming oplog: undefinded behaviour operation is running

The version what we use is as below, so in order to use the commands pbm status or pbm logs , i guess i need to update the version to 1.4. Alright i shall plan and do the version upgrade.

Here are the logs from the pbm-agent across the nodes,

Apr 15 04:00:02 myservername02 pbm-agent[96142]: 2021-04-15T04:00:02.000+0000 [INFO] got command backup [name: 2021-04-15T04:00:01Z, compression: s2] <ts: 1618459201>
Apr 15 06:00:02 myservername02 pbm-agent[96142]: 2021-04-15T06:00:02.000+0000 [INFO] got command backup [name: 2021-04-15T06:00:01Z, compression: s2] <ts: 1618466401>
Apr 15 06:00:02 myservername02 pbm-agent[96142]: 2021-04-15T06:00:02.000+0000 [ERROR] backup/2021-04-15T06:00:01Z: ensure no tmp collections: drop tmp roles collection pbmRRoles: (NotMaster) not master
Apr 15 07:00:01 myservername02 pbm-agent[96142]: 2021-04-15T07:00:01.000+0000 [INFO] got command delete <ts: 1618470001>
Apr 15 07:00:01 myservername02 pbm-agent[96142]: 2021-04-15T07:00:01.000+0000 [INFO] delete/2021-04-15T00:00:00Z: deleting backups older than 2021-04-15 00:00:00 +0000 UTC
Apr 15 07:00:01 myservername02 pbm-agent[96142]: 2021/04/15 07:00:01 Info: deleting 2021-03-20T06:00:01Z: unable to delete the last backup while PITR is on
Apr 15 07:00:06 myservername02 pbm-agent[96142]: 2021-04-15T07:00:06.000+0000 [INFO] delete/2021-04-15T00:00:00Z: done
Apr 15 08:00:02 myservername02 pbm-agent[96142]: 2021-04-15T08:00:02.000+0000 [INFO] got command backup [name: 2021-04-15T08:00:01Z, compression: s2] <ts: 1618473601>
Apr 15 08:00:02 myservername02 pbm-agent[96142]: 2021-04-15T08:00:02.000+0000 [ERROR] backup/2021-04-15T08:00:01Z: ensure no tmp collections: drop tmp roles collection pbmRRoles: (NotMaster) not master
Apr 15 10:00:01 myservername02 pbm-agent[96142]: 2021-04-15T10:00:01.000+0000 [INFO] got command backup [name: 2021-04-15T10:00:01Z, compression: s2] <ts: 1618480801>

Apr 15 00:00:02 myservername01 pbm-agent[22116]: 2021-04-15T00:00:02.000+0000 [INFO] got command backup [name: 2021-04-15T00:00:02Z, compression: s2] <ts: 1618444802>
Apr 15 02:00:02 myservername01 pbm-agent[22116]: 2021-04-15T02:00:02.000+0000 [INFO] got command backup [name: 2021-04-15T02:00:01Z, compression: s2] <ts: 1618452001>
Apr 15 02:00:02 myservername01 pbm-agent[22116]: 2021-04-15T02:00:02.000+0000 [ERROR] backup/2021-04-15T02:00:01Z: ensure no tmp collections: drop tmp roles collection pbmRRoles: (NotMaster) not master
Apr 15 04:00:01 myservername01 pbm-agent[22116]: 2021-04-15T04:00:01.000+0000 [INFO] got command backup [name: 2021-04-15T04:00:01Z, compression: s2] <ts: 1618459201>
Apr 15 06:00:02 myservername01 pbm-agent[22116]: 2021-04-15T06:00:02.000+0000 [INFO] got command backup [name: 2021-04-15T06:00:01Z, compression: s2] <ts: 1618466401>
Apr 15 07:00:01 myservername01 pbm-agent[22116]: 2021-04-15T07:00:01.000+0000 [INFO] got command delete <ts: 1618470001>
Apr 15 07:00:01 myservername01 pbm-agent[22116]: 2021-04-15T07:00:01.000+0000 [INFO] delete: scheduled to another node
Apr 15 08:00:02 myservername01 pbm-agent[22116]: 2021-04-15T08:00:02.000+0000 [INFO] got command backup [name: 2021-04-15T08:00:01Z, compression: s2] <ts: 1618473601>
Apr 15 10:00:01 myservername01 pbm-agent[22116]: 2021-04-15T10:00:01.000+0000 [INFO] got command backup [name: 2021-04-15T10:00:01Z, compression: s2] <ts: 1618480801>
Apr 15 10:00:01 myservername01 pbm-agent[22116]: 2021-04-15T10:00:01.000+0000 [ERROR] backup/2021-04-15T10:00:01Z: ensure no tmp collections: drop tmp roles collection pbmRRoles: (NotMaster) not master

Apr 14 22:00:02 myservername03 pbm-agent[15603]: 2021-04-14T22:00:02.000+0000 [INFO] got command backup [name: 2021-04-14T22:00:01Z, compression: s2] <ts: 1618437601>
Apr 15 00:00:02 myservername03 pbm-agent[15603]: 2021-04-15T00:00:02.000+0000 [INFO] got command backup [name: 2021-04-15T00:00:02Z, compression: s2] <ts: 1618444802>
Apr 15 02:00:02 myservername03 pbm-agent[15603]: 2021-04-15T02:00:02.000+0000 [INFO] got command backup [name: 2021-04-15T02:00:01Z, compression: s2] <ts: 1618452001>
Apr 15 04:00:02 myservername03 pbm-agent[15603]: 2021-04-15T04:00:02.000+0000 [INFO] got command backup [name: 2021-04-15T04:00:01Z, compression: s2] <ts: 1618459201>
Apr 15 04:00:02 myservername03 pbm-agent[15603]: 2021-04-15T04:00:02.000+0000 [ERROR] backup/2021-04-15T04:00:01Z: ensure no tmp collections: drop tmp roles collection pbmRRoles: (NotMaster) not master
Apr 15 06:00:02 myservername03 pbm-agent[15603]: 2021-04-15T06:00:02.000+0000 [INFO] got command backup [name: 2021-04-15T06:00:01Z, compression: s2] <ts: 1618466401>
Apr 15 07:00:01 myservername03 pbm-agent[15603]: 2021-04-15T07:00:01.000+0000 [INFO] got command delete <ts: 1618470001>
Apr 15 07:00:01 myservername03 pbm-agent[15603]: 2021-04-15T07:00:01.000+0000 [INFO] delete: scheduled to another node
Apr 15 08:00:02 myservername03 pbm-agent[15603]: 2021-04-15T08:00:02.000+0000 [INFO] got command backup [name: 2021-04-15T08:00:01Z, compression: s2] <ts: 1618473601>
Apr 15 10:00:02 myservername03 pbm-agent[15603]: 2021-04-15T10:00:02.000+0000 [INFO] got command backup [name: 2021-04-15T10:00:01Z, compression: s2] <ts: 1618480801>

Version details

Version: 1.3.4
Platform: linux/amd64
GitCommit: 2789dbc4973d52e0c4ee3963cbe70222f192b463
GitBranch: release-1.3.4
BuildTime: 2020-11-17_15:46_UTC
GoVersion: go1.14.2

Adding additional details which may help for you to review and help me on the issue.

I use the script to backup and it runs as a cron for every 2 hours. ( pbm backup)
PITR is set to ON , but unlike shared storage i am using local storage on all 3 servers and storing the files on all 3 under the same storage mountpoint name.

cleanup i use the pbm delete-backup -f --older-than but this will happen only on the server where i run the cronjob.

i have another cronjob in all 3 servers to check and clean the files manually which are been deleted as a part of the delete-backup , just to reclaim disk space .

we are planning to change to NFS/shared storage in the near future.
let me know if you need any additionals.

@TimSandberg - Many Thanks for sharing the steps that you have taken to resolve the issue, Looks like the steps you have provided are the ones which i have used last time to fix the issue but it occured again but looking to fix at one shot rather than keep doing the workaround.

pbm-agent NOT FOUND FAILED status

further to this , i have upgraded the tool to 1.4.1 and now pbm status shows for one node as above.

checking on it.

It is just telling the on 3rd node agent status is down. checking on it but agent seems running when i checked using systemctl status pbm-agent

root@myserver ~]# pbm status

Cluster:

rs0:

  • rs0/myservername01.int.tac.local:27017: pbm-agent v1.4.1 OK
  • rs0/myservername02:27017: pbm-agent v1.4.1 OK
  • rs0/myservername03:27017: pbm-agent NOT FOUND FAILED status:

PITR incremental backup:

Status [OFF]

Currently running:

Snapshot backup “2021-04-15T14:00:01Z”, started at 2021-04-15T14:00:18Z. Status: snapshot backup. [op id: 607846e179fc20e728668c04]

Backups:

FS /mongo-backup/db_dumpfiles/
Snapshots:
2021-04-15T14:00:01Z 0.00B [running: running / 2021-04-15T14:00:20]