@Andrew_Pogrebnoi
Sorry for the delayed response,
I just got the error once again and the backups are not running after that, error details as below,
Backup snapshots:
2021-03-20T06:00:01Z
PITR :
2021-03-20T06:49:20 - 2021-03-20T07:50:06
!Failed to run PITR backup. Agent logs:
rs0: PITR backup didn’t started
rs0: 2021-03-20T08:01:09.000+0000 [ERROR] pitr: streaming oplog: undefinded behaviour operation is running
The version what we use is as below, so in order to use the commands pbm status or pbm logs , i guess i need to update the version to 1.4. Alright i shall plan and do the version upgrade.
Here are the logs from the pbm-agent across the nodes,
Apr 15 04:00:02 myservername02 pbm-agent[96142]: 2021-04-15T04:00:02.000+0000 [INFO] got command backup [name: 2021-04-15T04:00:01Z, compression: s2] <ts: 1618459201>
Apr 15 06:00:02 myservername02 pbm-agent[96142]: 2021-04-15T06:00:02.000+0000 [INFO] got command backup [name: 2021-04-15T06:00:01Z, compression: s2] <ts: 1618466401>
Apr 15 06:00:02 myservername02 pbm-agent[96142]: 2021-04-15T06:00:02.000+0000 [ERROR] backup/2021-04-15T06:00:01Z: ensure no tmp collections: drop tmp roles collection pbmRRoles: (NotMaster) not master
Apr 15 07:00:01 myservername02 pbm-agent[96142]: 2021-04-15T07:00:01.000+0000 [INFO] got command delete <ts: 1618470001>
Apr 15 07:00:01 myservername02 pbm-agent[96142]: 2021-04-15T07:00:01.000+0000 [INFO] delete/2021-04-15T00:00:00Z: deleting backups older than 2021-04-15 00:00:00 +0000 UTC
Apr 15 07:00:01 myservername02 pbm-agent[96142]: 2021/04/15 07:00:01 Info: deleting 2021-03-20T06:00:01Z: unable to delete the last backup while PITR is on
Apr 15 07:00:06 myservername02 pbm-agent[96142]: 2021-04-15T07:00:06.000+0000 [INFO] delete/2021-04-15T00:00:00Z: done
Apr 15 08:00:02 myservername02 pbm-agent[96142]: 2021-04-15T08:00:02.000+0000 [INFO] got command backup [name: 2021-04-15T08:00:01Z, compression: s2] <ts: 1618473601>
Apr 15 08:00:02 myservername02 pbm-agent[96142]: 2021-04-15T08:00:02.000+0000 [ERROR] backup/2021-04-15T08:00:01Z: ensure no tmp collections: drop tmp roles collection pbmRRoles: (NotMaster) not master
Apr 15 10:00:01 myservername02 pbm-agent[96142]: 2021-04-15T10:00:01.000+0000 [INFO] got command backup [name: 2021-04-15T10:00:01Z, compression: s2] <ts: 1618480801>
Apr 15 00:00:02 myservername01 pbm-agent[22116]: 2021-04-15T00:00:02.000+0000 [INFO] got command backup [name: 2021-04-15T00:00:02Z, compression: s2] <ts: 1618444802>
Apr 15 02:00:02 myservername01 pbm-agent[22116]: 2021-04-15T02:00:02.000+0000 [INFO] got command backup [name: 2021-04-15T02:00:01Z, compression: s2] <ts: 1618452001>
Apr 15 02:00:02 myservername01 pbm-agent[22116]: 2021-04-15T02:00:02.000+0000 [ERROR] backup/2021-04-15T02:00:01Z: ensure no tmp collections: drop tmp roles collection pbmRRoles: (NotMaster) not master
Apr 15 04:00:01 myservername01 pbm-agent[22116]: 2021-04-15T04:00:01.000+0000 [INFO] got command backup [name: 2021-04-15T04:00:01Z, compression: s2] <ts: 1618459201>
Apr 15 06:00:02 myservername01 pbm-agent[22116]: 2021-04-15T06:00:02.000+0000 [INFO] got command backup [name: 2021-04-15T06:00:01Z, compression: s2] <ts: 1618466401>
Apr 15 07:00:01 myservername01 pbm-agent[22116]: 2021-04-15T07:00:01.000+0000 [INFO] got command delete <ts: 1618470001>
Apr 15 07:00:01 myservername01 pbm-agent[22116]: 2021-04-15T07:00:01.000+0000 [INFO] delete: scheduled to another node
Apr 15 08:00:02 myservername01 pbm-agent[22116]: 2021-04-15T08:00:02.000+0000 [INFO] got command backup [name: 2021-04-15T08:00:01Z, compression: s2] <ts: 1618473601>
Apr 15 10:00:01 myservername01 pbm-agent[22116]: 2021-04-15T10:00:01.000+0000 [INFO] got command backup [name: 2021-04-15T10:00:01Z, compression: s2] <ts: 1618480801>
Apr 15 10:00:01 myservername01 pbm-agent[22116]: 2021-04-15T10:00:01.000+0000 [ERROR] backup/2021-04-15T10:00:01Z: ensure no tmp collections: drop tmp roles collection pbmRRoles: (NotMaster) not master
Apr 14 22:00:02 myservername03 pbm-agent[15603]: 2021-04-14T22:00:02.000+0000 [INFO] got command backup [name: 2021-04-14T22:00:01Z, compression: s2] <ts: 1618437601>
Apr 15 00:00:02 myservername03 pbm-agent[15603]: 2021-04-15T00:00:02.000+0000 [INFO] got command backup [name: 2021-04-15T00:00:02Z, compression: s2] <ts: 1618444802>
Apr 15 02:00:02 myservername03 pbm-agent[15603]: 2021-04-15T02:00:02.000+0000 [INFO] got command backup [name: 2021-04-15T02:00:01Z, compression: s2] <ts: 1618452001>
Apr 15 04:00:02 myservername03 pbm-agent[15603]: 2021-04-15T04:00:02.000+0000 [INFO] got command backup [name: 2021-04-15T04:00:01Z, compression: s2] <ts: 1618459201>
Apr 15 04:00:02 myservername03 pbm-agent[15603]: 2021-04-15T04:00:02.000+0000 [ERROR] backup/2021-04-15T04:00:01Z: ensure no tmp collections: drop tmp roles collection pbmRRoles: (NotMaster) not master
Apr 15 06:00:02 myservername03 pbm-agent[15603]: 2021-04-15T06:00:02.000+0000 [INFO] got command backup [name: 2021-04-15T06:00:01Z, compression: s2] <ts: 1618466401>
Apr 15 07:00:01 myservername03 pbm-agent[15603]: 2021-04-15T07:00:01.000+0000 [INFO] got command delete <ts: 1618470001>
Apr 15 07:00:01 myservername03 pbm-agent[15603]: 2021-04-15T07:00:01.000+0000 [INFO] delete: scheduled to another node
Apr 15 08:00:02 myservername03 pbm-agent[15603]: 2021-04-15T08:00:02.000+0000 [INFO] got command backup [name: 2021-04-15T08:00:01Z, compression: s2] <ts: 1618473601>
Apr 15 10:00:02 myservername03 pbm-agent[15603]: 2021-04-15T10:00:02.000+0000 [INFO] got command backup [name: 2021-04-15T10:00:01Z, compression: s2] <ts: 1618480801>
Version details
Version: 1.3.4
Platform: linux/amd64
GitCommit: 2789dbc4973d52e0c4ee3963cbe70222f192b463
GitBranch: release-1.3.4
BuildTime: 2020-11-17_15:46_UTC
GoVersion: go1.14.2
Adding additional details which may help for you to review and help me on the issue.
I use the script to backup and it runs as a cron for every 2 hours. ( pbm backup)
PITR is set to ON , but unlike shared storage i am using local storage on all 3 servers and storing the files on all 3 under the same storage mountpoint name.
cleanup i use the pbm delete-backup -f --older-than but this will happen only on the server where i run the cronjob.
i have another cronjob in all 3 servers to check and clean the files manually which are been deleted as a part of the delete-backup , just to reclaim disk space .
we are planning to change to NFS/shared storage in the near future.
let me know if you need any additionals.
@TimSandberg - Many Thanks for sharing the steps that you have taken to resolve the issue, Looks like the steps you have provided are the ones which i have used last time to fix the issue but it occured again but looking to fix at one shot rather than keep doing the workaround.