PBM doesn't tolerate replset member being down

  1. Setup: 3 node non-arbiter non-sharded replset, running with pbm

    Two symptoms:

    While a member node is down, and that node happened to be the node that pbm was using for running it’s PITR, it “sticks” in the failure state and doesn’t try to switch to the other available secondary node.

  2. After bringing that node back online, pbm's PITR doesn't appear to recover from that until I poke it by reapplying config. (I may not be 100% correct on this one, I know I saw it when fully reprovisioning the node, not 100% certain I was seeing it on a simple restart.)

My general expectation is that if I had a cluster node fail in the middle of the night, PITR would continue to function until I got around to bringing that node back online so long as the cluster was operational.

Is this not a valid expectation?

pbm list run from current primary:

  2020-11-21T16:21:49 - 2020-11-21T16:52:17


!Failed to run PITR backup. Agent logs:
  repl-c-guild: 2020-11-21T17:23:59.000+0000 [ERROR] pitr: node check: get NodeInfo data: run mongo command: server selection error: server selection timeout
current topology: Type: Single
Servers:
Addr: localhost:27017, Type: Unknown, State: Connected, Average RTT: 0, Last error: connection() : dial tcp [::1]:27017: connect: connection refused


root@c-guild-c02-db01-s01:~# date
Sat Nov 21 17:24:17 UTC 2020

One thing that isn’t really clear from docs if I have correct - the PBM_MONGO_URI env var on each node in the replset only references localhost:27017. Should those be referencing the full cluster URI?

On the recovery on single node - it looks like once pbm sees an error talking to local instance, it doesn’t try again. Poking same config into it appears to wake it up to retry.

Hi.

Your expectation was correct. pbm-agents on other nodes should take over PITR if one fails. The next cycle of PITR work will recognize that the previous oplog slice was not completed and the active pbm-agent will create an oplog slice that starts at the failed slice’s starting time (or to put it another way, starting from the end of the last successful oplog slice). It will also delete a partially-created oplog slice file that might have been left around.

But there has been a bug. When the mongod node died the pbm-agent observed the connection session from the mongo driver as being alive (which is normal for mongodb drivers, then they will keep trying to reconnect so not as to fail immediately on any network disturbance). So the pbm-agent stayed active when it should have given up.

This was already noticed by a developer whilst they were working on PBM-345 and PBM-435 and they patched it already (without a separate ticket). The fix is coming in v1.4.

I appreciate the informative bug report, I apologize that you were affected by this.

One thing that isn’t really clear from docs if I have correct - the PBM_MONGO_URI env var on each node in the replset only references localhost:27017. Should those be referencing the full cluster URI?

No, localhost:port is sufficient and correct for the pbm-agent nodes. They automatically discover the topology after they’ve made the first local connection, then make the extra connections they need.

Akira

It’s retroactively becoming it’s own ticket: https://jira.percona.com/browse/PBM-597