Mongodb backup errors - percona backup

Getting an error as below on Mongodb backup for PBM,

Any methods to mitigiate,

!Failed to run PITR backup. Agent logs:
rs0: PITR backup didn’t started
rs:0 2021-02-01T… [ERROR] pitr: streaming oplog: undefinded behaviour operation is running

in addition getting something as,

Error starting backup: no confirmation that backup has successfully started. Replsets status:

1 Like

Hi @deepaklo
Can you share please the output of:

  1. pbm status
  2. pbm logs -t 0

“Error starting backup: …” pops up when you’re trying to run pbm backup?

1 Like

I’m having the exact same issue.

pbm status yields:
Cluster:

========

mongo-prod-set:

  • mongo-prod-set/mongodb2-prod:27017: pbm-agent NOT FOUND FAILED status:

  • mongo-prod-set/mongodb1-prod:27017: pbm-agent NOT FOUND FAILED status:

  • mongo-prod-set/mongodb3-prod:27017: pbm-agent NOT FOUND FAILED status:

PITR incremental backup:

========================

Status [ON]

Currently running:

==================

(none)

pbm log -t 0 doesn’t retrieve anything

Also of note, I’m still running v1.3 as there is a dependency conflict with the persona-release repo that’s preventing me from upgrading.

1 Like

Disregard. I got it working again by following these steps:

  1. Restarted the agent on all of the servers
  2. Disabled PITR
  3. Executed a pbm backup
  4. Re-enabled PITR

Also, I discovered that pbm is updating on my Amazon Linux 2 boxes, just not the old Amazon Linux 2018.03 box that I use for executing the backups.

2 Likes

@TimSandberg
Yes, both pbm status and pbm logs are introduced in v1.4. So older version agents unable to communicate properly with the cli for that matters.

1 Like

@Andrew_Pogrebnoi

Sorry for the delayed response,

I just got the error once again and the backups are not running after that, error details as below,

Backup snapshots:
2021-03-20T06:00:01Z
PITR :
2021-03-20T06:49:20 - 2021-03-20T07:50:06

!Failed to run PITR backup. Agent logs:
rs0: PITR backup didn’t started
rs0: 2021-03-20T08:01:09.000+0000 [ERROR] pitr: streaming oplog: undefinded behaviour operation is running

The version what we use is as below, so in order to use the commands pbm status or pbm logs , i guess i need to update the version to 1.4. Alright i shall plan and do the version upgrade.

Here are the logs from the pbm-agent across the nodes,

Apr 15 04:00:02 myservername02 pbm-agent[96142]: 2021-04-15T04:00:02.000+0000 [INFO] got command backup [name: 2021-04-15T04:00:01Z, compression: s2] <ts: 1618459201>
Apr 15 06:00:02 myservername02 pbm-agent[96142]: 2021-04-15T06:00:02.000+0000 [INFO] got command backup [name: 2021-04-15T06:00:01Z, compression: s2] <ts: 1618466401>
Apr 15 06:00:02 myservername02 pbm-agent[96142]: 2021-04-15T06:00:02.000+0000 [ERROR] backup/2021-04-15T06:00:01Z: ensure no tmp collections: drop tmp roles collection pbmRRoles: (NotMaster) not master
Apr 15 07:00:01 myservername02 pbm-agent[96142]: 2021-04-15T07:00:01.000+0000 [INFO] got command delete <ts: 1618470001>
Apr 15 07:00:01 myservername02 pbm-agent[96142]: 2021-04-15T07:00:01.000+0000 [INFO] delete/2021-04-15T00:00:00Z: deleting backups older than 2021-04-15 00:00:00 +0000 UTC
Apr 15 07:00:01 myservername02 pbm-agent[96142]: 2021/04/15 07:00:01 Info: deleting 2021-03-20T06:00:01Z: unable to delete the last backup while PITR is on
Apr 15 07:00:06 myservername02 pbm-agent[96142]: 2021-04-15T07:00:06.000+0000 [INFO] delete/2021-04-15T00:00:00Z: done
Apr 15 08:00:02 myservername02 pbm-agent[96142]: 2021-04-15T08:00:02.000+0000 [INFO] got command backup [name: 2021-04-15T08:00:01Z, compression: s2] <ts: 1618473601>
Apr 15 08:00:02 myservername02 pbm-agent[96142]: 2021-04-15T08:00:02.000+0000 [ERROR] backup/2021-04-15T08:00:01Z: ensure no tmp collections: drop tmp roles collection pbmRRoles: (NotMaster) not master
Apr 15 10:00:01 myservername02 pbm-agent[96142]: 2021-04-15T10:00:01.000+0000 [INFO] got command backup [name: 2021-04-15T10:00:01Z, compression: s2] <ts: 1618480801>

Apr 15 00:00:02 myservername01 pbm-agent[22116]: 2021-04-15T00:00:02.000+0000 [INFO] got command backup [name: 2021-04-15T00:00:02Z, compression: s2] <ts: 1618444802>
Apr 15 02:00:02 myservername01 pbm-agent[22116]: 2021-04-15T02:00:02.000+0000 [INFO] got command backup [name: 2021-04-15T02:00:01Z, compression: s2] <ts: 1618452001>
Apr 15 02:00:02 myservername01 pbm-agent[22116]: 2021-04-15T02:00:02.000+0000 [ERROR] backup/2021-04-15T02:00:01Z: ensure no tmp collections: drop tmp roles collection pbmRRoles: (NotMaster) not master
Apr 15 04:00:01 myservername01 pbm-agent[22116]: 2021-04-15T04:00:01.000+0000 [INFO] got command backup [name: 2021-04-15T04:00:01Z, compression: s2] <ts: 1618459201>
Apr 15 06:00:02 myservername01 pbm-agent[22116]: 2021-04-15T06:00:02.000+0000 [INFO] got command backup [name: 2021-04-15T06:00:01Z, compression: s2] <ts: 1618466401>
Apr 15 07:00:01 myservername01 pbm-agent[22116]: 2021-04-15T07:00:01.000+0000 [INFO] got command delete <ts: 1618470001>
Apr 15 07:00:01 myservername01 pbm-agent[22116]: 2021-04-15T07:00:01.000+0000 [INFO] delete: scheduled to another node
Apr 15 08:00:02 myservername01 pbm-agent[22116]: 2021-04-15T08:00:02.000+0000 [INFO] got command backup [name: 2021-04-15T08:00:01Z, compression: s2] <ts: 1618473601>
Apr 15 10:00:01 myservername01 pbm-agent[22116]: 2021-04-15T10:00:01.000+0000 [INFO] got command backup [name: 2021-04-15T10:00:01Z, compression: s2] <ts: 1618480801>
Apr 15 10:00:01 myservername01 pbm-agent[22116]: 2021-04-15T10:00:01.000+0000 [ERROR] backup/2021-04-15T10:00:01Z: ensure no tmp collections: drop tmp roles collection pbmRRoles: (NotMaster) not master

Apr 14 22:00:02 myservername03 pbm-agent[15603]: 2021-04-14T22:00:02.000+0000 [INFO] got command backup [name: 2021-04-14T22:00:01Z, compression: s2] <ts: 1618437601>
Apr 15 00:00:02 myservername03 pbm-agent[15603]: 2021-04-15T00:00:02.000+0000 [INFO] got command backup [name: 2021-04-15T00:00:02Z, compression: s2] <ts: 1618444802>
Apr 15 02:00:02 myservername03 pbm-agent[15603]: 2021-04-15T02:00:02.000+0000 [INFO] got command backup [name: 2021-04-15T02:00:01Z, compression: s2] <ts: 1618452001>
Apr 15 04:00:02 myservername03 pbm-agent[15603]: 2021-04-15T04:00:02.000+0000 [INFO] got command backup [name: 2021-04-15T04:00:01Z, compression: s2] <ts: 1618459201>
Apr 15 04:00:02 myservername03 pbm-agent[15603]: 2021-04-15T04:00:02.000+0000 [ERROR] backup/2021-04-15T04:00:01Z: ensure no tmp collections: drop tmp roles collection pbmRRoles: (NotMaster) not master
Apr 15 06:00:02 myservername03 pbm-agent[15603]: 2021-04-15T06:00:02.000+0000 [INFO] got command backup [name: 2021-04-15T06:00:01Z, compression: s2] <ts: 1618466401>
Apr 15 07:00:01 myservername03 pbm-agent[15603]: 2021-04-15T07:00:01.000+0000 [INFO] got command delete <ts: 1618470001>
Apr 15 07:00:01 myservername03 pbm-agent[15603]: 2021-04-15T07:00:01.000+0000 [INFO] delete: scheduled to another node
Apr 15 08:00:02 myservername03 pbm-agent[15603]: 2021-04-15T08:00:02.000+0000 [INFO] got command backup [name: 2021-04-15T08:00:01Z, compression: s2] <ts: 1618473601>
Apr 15 10:00:02 myservername03 pbm-agent[15603]: 2021-04-15T10:00:02.000+0000 [INFO] got command backup [name: 2021-04-15T10:00:01Z, compression: s2] <ts: 1618480801>

Version details

Version: 1.3.4
Platform: linux/amd64
GitCommit: 2789dbc4973d52e0c4ee3963cbe70222f192b463
GitBranch: release-1.3.4
BuildTime: 2020-11-17_15:46_UTC
GoVersion: go1.14.2

Adding additional details which may help for you to review and help me on the issue.

I use the script to backup and it runs as a cron for every 2 hours. ( pbm backup)
PITR is set to ON , but unlike shared storage i am using local storage on all 3 servers and storing the files on all 3 under the same storage mountpoint name.

cleanup i use the pbm delete-backup -f --older-than but this will happen only on the server where i run the cronjob.

i have another cronjob in all 3 servers to check and clean the files manually which are been deleted as a part of the delete-backup , just to reclaim disk space .

we are planning to change to NFS/shared storage in the near future.
let me know if you need any additionals.

@TimSandberg - Many Thanks for sharing the steps that you have taken to resolve the issue, Looks like the steps you have provided are the ones which i have used last time to fix the issue but it occured again but looking to fix at one shot rather than keep doing the workaround.

1 Like

pbm-agent NOT FOUND FAILED status

further to this , i have upgraded the tool to 1.4.1 and now pbm status shows for one node as above.

checking on it.

1 Like

It is just telling the on 3rd node agent status is down. checking on it but agent seems running when i checked using systemctl status pbm-agent

root@myserver ~]# pbm status

Cluster:

rs0:

  • rs0/myservername01.int.tac.local:27017: pbm-agent v1.4.1 OK
  • rs0/myservername02:27017: pbm-agent v1.4.1 OK
  • rs0/myservername03:27017: pbm-agent NOT FOUND FAILED status:

PITR incremental backup:

Status [OFF]

Currently running:

Snapshot backup “2021-04-15T14:00:01Z”, started at 2021-04-15T14:00:18Z. Status: snapshot backup. [op id: 607846e179fc20e728668c04]

Backups:

FS /mongo-backup/db_dumpfiles/
Snapshots:
2021-04-15T14:00:01Z 0.00B [running: running / 2021-04-15T14:00:20]

1 Like