PBM doesn't tolerate replset member being down

Nathan_Neulinger · November 21, 2020, 12:04pm

Setup: 3 node non-arbiter non-sharded replset, running with pbm
Two symptoms:

While a member node is down, and that node happened to be the node that pbm was using for running it’s PITR, it “sticks” in the failure state and doesn’t try to switch to the other available secondary node.

After bringing that node back online, pbm's PITR doesn't appear to recover from that until I poke it by reapplying config. (I may not be 100% correct on this one, I know I saw it when fully reprovisioning the node, not 100% certain I was seeing it on a simple restart.)

My general expectation is that if I had a cluster node fail in the middle of the night, PITR would continue to function until I got around to bringing that node back online so long as the cluster was operational.

Is this not a valid expectation?

pbm list run from current primary:

  2020-11-21T16:21:49 - 2020-11-21T16:52:17


!Failed to run PITR backup. Agent logs:
  repl-c-guild: 2020-11-21T17:23:59.000+0000 [ERROR] pitr: node check: get NodeInfo data: run mongo command: server selection error: server selection timeout
current topology: Type: Single
Servers:
Addr: localhost:27017, Type: Unknown, State: Connected, Average RTT: 0, Last error: connection() : dial tcp [::1]:27017: connect: connection refused


root@c-guild-c02-db01-s01:~# date
Sat Nov 21 17:24:17 UTC 2020

One thing that isn’t really clear from docs if I have correct - the PBM_MONGO_URI env var on each node in the replset only references localhost:27017. Should those be referencing the full cluster URI?

Nathan_Neulinger · November 21, 2020, 12:11pm

On the recovery on single node - it looks like once pbm sees an error talking to local instance, it doesn’t try again. Poking same config into it appears to wake it up to retry.

Akira_Kurogane · November 23, 2020, 3:37am

Hi.

Your expectation was correct. pbm-agents on other nodes should take over PITR if one fails. The next cycle of PITR work will recognize that the previous oplog slice was not completed and the active pbm-agent will create an oplog slice that starts at the failed slice’s starting time (or to put it another way, starting from the end of the last successful oplog slice). It will also delete a partially-created oplog slice file that might have been left around.

But there has been a bug. When the mongod node died the pbm-agent observed the connection session from the mongo driver as being alive (which is normal for mongodb drivers, then they will keep trying to reconnect so not as to fail immediately on any network disturbance). So the pbm-agent stayed active when it should have given up.

This was already noticed by a developer whilst they were working on PBM-345 and PBM-435 and they patched it already (without a separate ticket). The fix is coming in v1.4.

I appreciate the informative bug report, I apologize that you were affected by this.

One thing that isn’t really clear from docs if I have correct - the PBM_MONGO_URI env var on each node in the replset only references localhost:27017. Should those be referencing the full cluster URI?

No, localhost:port is sufficient and correct for the pbm-agent nodes. They automatically discover the topology after they’ve made the first local connection, then make the extra connections they need.

Akira

Akira_Kurogane · November 23, 2020, 3:45am

It’s retroactively becoming it’s own ticket: https://jira.percona.com/browse/PBM-597

Topic		Replies	Views
PBM doesn't work when a mongodb replica set member is down Percona Backup for MongoDB	4	766	April 22, 2022
Issue On PBM Backup MongoDB	6	41	November 4, 2024
Pbm restore fails Percona Backup for MongoDB percona , mongodb , pbm	5	1232	October 19, 2023
No available agent(s) on replsets pbm Percona Backup for MongoDB pmm , mongodb , pbm	6	1871	January 5, 2024
Cannot take backup on sharded cluster Percona Backup for MongoDB	2	1311	June 11, 2020

PBM doesn't tolerate replset member being down

Related topics