- Setup: 3 node non-arbiter non-sharded replset, running with pbm
While a member node is down, and that node happened to be the node that pbm was using for running it’s PITR, it “sticks” in the failure state and doesn’t try to switch to the other available secondary node.
- After bringing that node back online, pbm's PITR doesn't appear to recover from that until I poke it by reapplying config. (I may not be 100% correct on this one, I know I saw it when fully reprovisioning the node, not 100% certain I was seeing it on a simple restart.)
My general expectation is that if I had a cluster node fail in the middle of the night, PITR would continue to function until I got around to bringing that node back online so long as the cluster was operational.
Is this not a valid expectation?
pbm list run from current primary:
2020-11-21T16:21:49 - 2020-11-21T16:52:17 !Failed to run PITR backup. Agent logs: repl-c-guild: 2020-11-21T17:23:59.000+0000 [ERROR] pitr: node check: get NodeInfo data: run mongo command: server selection error: server selection timeout current topology: Type: Single Servers: Addr: localhost:27017, Type: Unknown, State: Connected, Average RTT: 0, Last error: connection() : dial tcp [::1]:27017: connect: connection refused root@c-guild-c02-db01-s01:~# date Sat Nov 21 17:24:17 UTC 2020
One thing that isn’t really clear from docs if I have correct - the PBM_MONGO_URI env var on each node in the replset only references localhost:27017. Should those be referencing the full cluster URI?