Pbm restore fails

Hi everyone, I need help on this.
I’m testing the use of pbm for my db’s on a sharded cluster, and I have a problem with the restore of the backups; the output error is:
$ pbm restore “2023-05-16T12:33:42Z”
Starting restore 2023-05-16T14:52:51.495209807Z from ‘2023-05-16T12:33:42Z’…Error: no confirmation that restore has successfully started. Replsets status:
< no replset has started restore >

I read the log file of the config replica set
Primary .137: [restore/2023-05-16T14:34:20.973425037Z] waiting for starting status in rs map[.pbm.restore/2023-05-16T14:34:20.973425037Z/rs.ReplicaConfig/node.192.168.1.137:27017:{} .pbm.restore/2023-05-16T14:34:20.973425037Z/rs.ReplicaConfig/node.192.168.1.138:27017:{} .pbm.restore/2023-05-16T14:34:20.973425037Z/rs.ReplicaConfig/node.192.168.1.139:27017:{}]
Secondary .138, .139: [restore/2023-05-15T14:59:54.852458378Z] waiting for cluster
but I cannot understand the problem, could someone help me?

Hi Marta, welcome to the forum. Which version of PBM are you testing? First of all make sure that pbm-agent is running on all the nodes. If that is the case, check the logs on every agent. Also is mongod process up and running everywhere?

Hi Ivan, thanks for the answer.
I’m using the version 2.1.0. I confirm that both mongod and pbm-agent are running on all the nodes.
Here the logs of the shards replica sets’ (I only write one replica, the only difference is on the IP addresses).

Primary .135:
got command restore [name: 2023-05-16T15:54:43.880892574Z, backup name: 2023-05-16T12:33:42Z] <ts: 1684252483>
got epoch {1684243958 37}
[restore/2023-05-16T15:54:43.880892574Z] backup: 2023-05-16T12:33:42Z
[restore/2023-05-16T15:54:43.880892574Z] restore started
[restore/2023-05-16T15:54:43.880892574Z] port: 27086
[restore/2023-05-16T15:54:43.880892574Z] mongod binary: mongod, version: v6.0.5-4
[restore/2023-05-16T15:54:43.880892574Z] moving to state starting
[restore/2023-05-16T15:54:43.880892574Z] waiting for starting status in rs map[.pbm.restore/2023-05-16T15:54:43.880892574Z/rs.ReplicaShard/node.192.168.1.134:27017:{} .pbm.restore/2023-05-16T15:54:43.880892574Z/rs.ReplicaShard/node.192.168.1.135:27017:{} .pbm.restore/2023-05-16T15:54:43.880892574Z/rs.ReplicaShard/node.192.168.1.136:27017:{}]
[restore/2023-05-16T15:54:43.880892574Z] rm tmp conf
[restore/2023-05-16T15:54:43.880892574Z] hearbeats stopped
[restore/2023-05-16T15:54:43.880892574Z] restore finished move to running state: wait for nodes in rs: check heartbeat in .pbm.restore/2023-05-16T15:54:43.880892574Z/rs.ReplicaShard/node.192.168.1.134:27017.hb: stuck, last beat ts: 1684252484
[restore/2023-05-16T15:54:43.880892574Z] move to running state: wait for nodes in rs: check heartbeat in .pbm.restore/2023-05-16T15:54:43.880892574Z/rs.ReplicaShard/node.192.168.1.134:27017.hb: stuck, last beat ts: 1684252484

Secondary .134, 136:
got command restore [name: 2023-05-16T12:46:56.956214359Z, backup name: 2023-05-16T08:22:15Z] <ts: 1684241216>
got epoch {1684241136 8}
[restore/2023-05-16T12:46:56.956214359Z] backup: 2023-05-16T08:22:15Z
[restore/2023-05-16T12:46:56.956214359Z] restore started
[restore/2023-05-16T12:46:56.956214359Z] port: 28030
[restore/2023-05-16T12:46:56.956214359Z] mongod binary: mongod, version: v6.0.5-4
[restore/2023-05-16T12:46:56.956214359Z] moving to state starting
[restore/2023-05-16T12:46:56.956214359Z] waiting for cluster

As per the log above, it seems PBM is failing to receive a heartbeat from 192.168.1.134:27017. Can you check that node is healthy? maybe look on mongod.log for any clues?

192.168.1.134 is healthy. But the same log appears on both the shard replica set and the config replica set, signaling nodes 192.168.1.131, .134 and .138.
I report the three mongod logs timed closest to the pbm error:

192.168.1.131:
{“t”:{“$date”:“2023-05-16T15:58:30.717+00:00”},“s”:“I”, “c”:“NETWORK”, “id”:22943, “ctx”:“listener”,“msg”:“Connection accepted”,“attr”:{“remote”:“192.168.1.131:48996”,“uuid”:“f5d6fcdd-620c-41ac-967c-65a332ab54fd”,“connectionId”:198,“connectionCount”:34}}
{“t”:{“$date”:“2023-05-16T15:58:30.717+00:00”},“s”:“I”, “c”:“NETWORK”, “id”:51800, “ctx”:“conn198”,“msg”:“client metadata”,“attr”:{“remote”:“192.168.1.131:48996”,“client”:“conn198”,“doc”:{“driver”:{“name”:“NetworkInterfaceTL-TaskExecutorPool-0”,“version”:“6.0.5-4”},“os”:{“type”:“Linux”,“name”:“Ubuntu”,“architecture”:“x86_64”,“version”:“22.04”}}}}
{“t”:{“$date”:“2023-05-16T15:58:30.748+00:00”},“s”:“I”, “c”:“NETWORK”, “id”:22944, “ctx”:“conn192”,“msg”:“Connection ended”,“attr”:{“remote”:“192.168.1.131:58022”,“uuid”:“1c661fc1-7016-466d-9a78-8f2c9f2727c0”,“connectionId”:192,“connectionCount”:33}}

192.168.1.134:
{“t”:{“$date”:“2023-05-16T15:58:24.627+00:00”},“s”:“I”, “c”:“SH_REFR”, “id”:4619901, “ctx”:“CatalogCache-42”,“msg”:“Refreshed cached collection”,“attr”:{“namespace”:“config.system.sessions”,“lookupSinceVersion”:“1|1024||64620025b37a33f963149d0f||Timestamp(1684144165, 9)”,“newVersion”:“{ chunkVersion: { t: Timestamp(1684144165, 9), e: ObjectId(‘64620025b37a33f963149d0f’), v: Timestamp(1, 1024) }, forcedRefreshSequenceNum: 169, epochDisambiguatingSequenceNum: 127 }”,“timeInStore”:“{ chunkVersion: "None", forcedRefreshSequenceNum: 168, epochDisambiguatingSequenceNum: 126 }”,“durationMillis”:7}}

192.168.1.138:
{“t”:{“$date”:“2023-05-16T15:56:43.886+00:00”},“s”:“I”, “c”:“NETWORK”, “id”:22944, “ctx”:“conn972”,“msg”:“Connection ended”,“attr”:{“remote”:“192.168.1.142:45586”,“uuid”:“f298e40e-8c17-4340-aa2d-1bca66a1c7e3”,“connectionId”:972,“connectionCount”:42}}
{“t”:{“$date”:“2023-05-16T15:59:25.269+00:00”},“s”:“I”, “c”:“SH_REFR”, “id”:4619901, “ctx”:“CatalogCache-590”,“msg”:“Refreshed cached collection”,“attr”:{“namespace”:“config.system.sessions”,“lookupSinceVersion”:“1|1024||64620025b37a33f963149d0f||Timestamp(1684144165, 9)”,“newVersion”:“{ chunkVersion: { t: Timestamp(1684144165, 9), e: ObjectId(‘64620025b37a33f963149d0f’), v: Timestamp(1, 1024) }, forcedRefreshSequenceNum: 3441, epochDisambiguatingSequenceNum: 3437 }”,“timeInStore”:“{ chunkVersion: "None", forcedRefreshSequenceNum: 3440, epochDisambiguatingSequenceNum: 3436 }”,“durationMillis”:1}}

Is it possible that I did something wrong in the installation process?

I have also met this error, after checking the code, It only takes 15 seconds to wait for the restore to start, unfortunately, it is not configurable, after changing to 300 seconds and rebuilding the restore process was successful.


Starting restore to the point in time '2023-10-19T00:00:00'.................................Started.

The error before udate:
Starting restore to the point in time '2023-10-19T00:00:00'...............Error: no confirmation that restore has successfully started. Replsets status: