Restore in each MongoDB node succeeds but general status remains as running

Hello,

We are testing the restoration of a physical backup in a three node cluster configured with one shard. Each node contains a configsvr instance, db instance and mongos instance. Each node also contains two PBM agents, one for configsvr and the other for the db instance. The restore does not end, it always show the status “running” even when all nodes have finished successfully their restoration.

Here are the logs of one of the pbm agents for a db instance. The logs of the rest of pbm agents (configsvr and db are rather similar):

nov 28 17:01:48 server1 pbm-agent[1114218]: 2024-11-28T17:01:48.000+0100 I [restore/2024-11-28T14:42:50.674621101Z] preparing data
nov 28 17:01:54 server1 pbm-agent[1114218]: 2024-11-28T17:01:54.000+0100 D [restore/2024-11-28T14:42:50.674621101Z] oplogTruncateAfterPoint: {1732801816 383}
nov 28 17:01:56 server1 pbm-agent[1114218]: 2024-11-28T17:01:56.000+0100 I [restore/2024-11-28T14:42:50.674621101Z] recovering oplog as standalone
nov 28 17:02:02 server1 pbm-agent[1114218]: 2024-11-28T17:02:02.000+0100 I [restore/2024-11-28T14:42:50.674621101Z] clean-up and reset replicaset config
nov 28 17:02:08 server1 pbm-agent[1114218]: 2024-11-28T17:02:08.000+0100 I [restore/2024-11-28T14:42:50.674621101Z] restore on node succeed
nov 28 17:02:08 server1 pbm-agent[1114218]: 2024-11-28T17:02:08.000+0100 I [restore/2024-11-28T14:42:50.674621101Z] moving to state done
nov 28 17:02:08 server1 pbm-agent[1114218]: 2024-11-28T17:02:08.000+0100 I [restore/2024-11-28T14:42:50.674621101Z] waiting for done status in rs map[.pbm.restore/2024-11-28T14:42:50.674621101Z/rs.mongo-db0/node.server1.reta.local:27018:{} .pbm.restore/2024-11-28T14:42:50.674621101Z/rs.mongo-db0/node.server2.reta.local:27018:{} .pbm.restore/2024-11-28T14:42:50.674621101Z/rs.mongo-db0/node.server3.reta.local:27018:{}]
nov 28 17:02:13 server1 pbm-agent[1114218]: 2024-11-28T17:02:13.000+0100 I [restore/2024-11-28T14:42:50.674621101Z] waiting for shards map[.pbm.restore/2024-11-28T14:42:50.674621101Z/rs.mongo-configsvr/rs:{} .pbm.restore/2024-11-28T14:42:50.674621101Z/rs.mongo-db0/rs:{}]
nov 28 17:02:18 server1 pbm-agent[1114218]: 2024-11-28T17:02:18.000+0100 D [restore/2024-11-28T14:42:50.674621101Z] rm tmp conf
nov 28 17:02:18 server1 pbm-agent[1114218]: 2024-11-28T17:02:18.000+0100 E [restore/2024-11-28T14:42:50.674621101Z] restore: moving to state done: wait for shards: check heartbeat in .pbm.restore/2024-11-28T14:42:50.674621101Z/rs.mongo-configsvr/rs.hb: stuck, last beat ts: 1732804971
nov 28 17:02:18 server1 pbm-agent[1114218]: 2024-11-28T17:02:18.000+0100 D [restore/2024-11-28T14:42:50.674621101Z] hearbeats stopped
nov 28 17:02:18 server1 pbm-agent[1114218]: 2024-11-28T17:02:18.000+0100 I change stream was closed
nov 28 17:02:18 server1 pbm-agent[1114218]: 2024-11-28T17:02:18.000+0100 D [agentCheckup] deleting agent status
nov 28 17:02:18 server1 pbm-agent[1114218]: 2024-11-28T17:02:18.000+0100 E [pitr] init: get conf: get: server selection error: context canceled, current topology: { Type: ReplicaSetNoPrimary, Servers: [{ Addr: server1.reta.local:27019, Type: Unknown, Last error: dial tcp 127.0.1.1:27019: connect: connection refused }, { Addr: server2.reta.local:27019, Type: Unknown, Last error: dial tcp 172.16.61.52:27019: connect: connection refused }, { Addr: server3.reta.local:27019, Type: Unknown, Last error: dial tcp 172.16.61.53:27019: connect: connection refused }, ] }
nov 28 17:02:18 server1 pbm-agent[1114218]: 2024/11/28 17:02:18 Exit:
nov 28 17:02:18 server1 systemd[1]: pbm-agent-db.service: Succeeded.
– Subject: Unit succeeded
– Defined-By: systemd
– Support: Enterprise open source support | Ubuntu

– The unit pbm-agent-db.service has successfully entered the ‘dead’ state.

We see that the restoration of the node is successful but then waits for the acnowledgement of the rest of nodes. How can we know what might be failing here? After the restoration, all instances (mongodb & pbm agents) get stopped and we need to start them manually. After starting all services again, everything work as expected and data is accessible with no problems