Pbm restore fails

Marta · May 16, 2023, 3:18pm

Hi everyone, I need help on this.
I’m testing the use of pbm for my db’s on a sharded cluster, and I have a problem with the restore of the backups; the output error is:
$ pbm restore “2023-05-16T12:33:42Z”
Starting restore 2023-05-16T14:52:51.495209807Z from ‘2023-05-16T12:33:42Z’…Error: no confirmation that restore has successfully started. Replsets status:
< no replset has started restore >

I read the log file of the config replica set
Primary .137: [restore/2023-05-16T14:34:20.973425037Z] waiting for starting status in rs map[.pbm.restore/2023-05-16T14:34:20.973425037Z/rs.ReplicaConfig/node.192.168.1.137:27017:{} .pbm.restore/2023-05-16T14:34:20.973425037Z/rs.ReplicaConfig/node.192.168.1.138:27017:{} .pbm.restore/2023-05-16T14:34:20.973425037Z/rs.ReplicaConfig/node.192.168.1.139:27017:{}]
Secondary .138, .139: [restore/2023-05-15T14:59:54.852458378Z] waiting for cluster
but I cannot understand the problem, could someone help me?

Ivan_Groenewold · May 17, 2023, 11:02am

Hi Marta, welcome to the forum. Which version of PBM are you testing? First of all make sure that pbm-agent is running on all the nodes. If that is the case, check the logs on every agent. Also is mongod process up and running everywhere?

Marta · May 17, 2023, 12:48pm

Hi Ivan, thanks for the answer.
I’m using the version 2.1.0. I confirm that both mongod and pbm-agent are running on all the nodes.
Here the logs of the shards replica sets’ (I only write one replica, the only difference is on the IP addresses).

Primary .135:
got command restore [name: 2023-05-16T15:54:43.880892574Z, backup name: 2023-05-16T12:33:42Z] <ts: 1684252483>
got epoch {1684243958 37}
[restore/2023-05-16T15:54:43.880892574Z] backup: 2023-05-16T12:33:42Z
[restore/2023-05-16T15:54:43.880892574Z] restore started
[restore/2023-05-16T15:54:43.880892574Z] port: 27086
[restore/2023-05-16T15:54:43.880892574Z] mongod binary: mongod, version: v6.0.5-4
[restore/2023-05-16T15:54:43.880892574Z] moving to state starting
[restore/2023-05-16T15:54:43.880892574Z] waiting for starting status in rs map[.pbm.restore/2023-05-16T15:54:43.880892574Z/rs.ReplicaShard/node.192.168.1.134:27017:{} .pbm.restore/2023-05-16T15:54:43.880892574Z/rs.ReplicaShard/node.192.168.1.135:27017:{} .pbm.restore/2023-05-16T15:54:43.880892574Z/rs.ReplicaShard/node.192.168.1.136:27017:{}]
[restore/2023-05-16T15:54:43.880892574Z] rm tmp conf
[restore/2023-05-16T15:54:43.880892574Z] hearbeats stopped
[restore/2023-05-16T15:54:43.880892574Z] restore finished move to running state: wait for nodes in rs: check heartbeat in .pbm.restore/2023-05-16T15:54:43.880892574Z/rs.ReplicaShard/node.192.168.1.134:27017.hb: stuck, last beat ts: 1684252484
[restore/2023-05-16T15:54:43.880892574Z] move to running state: wait for nodes in rs: check heartbeat in .pbm.restore/2023-05-16T15:54:43.880892574Z/rs.ReplicaShard/node.192.168.1.134:27017.hb: stuck, last beat ts: 1684252484

Secondary .134, 136:
got command restore [name: 2023-05-16T12:46:56.956214359Z, backup name: 2023-05-16T08:22:15Z] <ts: 1684241216>
got epoch {1684241136 8}
[restore/2023-05-16T12:46:56.956214359Z] backup: 2023-05-16T08:22:15Z
[restore/2023-05-16T12:46:56.956214359Z] restore started
[restore/2023-05-16T12:46:56.956214359Z] port: 28030
[restore/2023-05-16T12:46:56.956214359Z] mongod binary: mongod, version: v6.0.5-4
[restore/2023-05-16T12:46:56.956214359Z] moving to state starting
[restore/2023-05-16T12:46:56.956214359Z] waiting for cluster

Ivan_Groenewold · May 17, 2023, 1:08pm

As per the log above, it seems PBM is failing to receive a heartbeat from 192.168.1.134:27017. Can you check that node is healthy? maybe look on mongod.log for any clues?

Marta · May 17, 2023, 1:53pm

192.168.1.134 is healthy. But the same log appears on both the shard replica set and the config replica set, signaling nodes 192.168.1.131, .134 and .138.
I report the three mongod logs timed closest to the pbm error:

192.168.1.131:
{“t”:{“$date”:“2023-05-16T15:58:30.717+00:00”},“s”:“I”, “c”:“NETWORK”, “id”:22943, “ctx”:“listener”,“msg”:“Connection accepted”,“attr”:{“remote”:“192.168.1.131:48996”,“uuid”:“f5d6fcdd-620c-41ac-967c-65a332ab54fd”,“connectionId”:198,“connectionCount”:34}}
{“t”:{“$date”:“2023-05-16T15:58:30.717+00:00”},“s”:“I”, “c”:“NETWORK”, “id”:51800, “ctx”:“conn198”,“msg”:“client metadata”,“attr”:{“remote”:“192.168.1.131:48996”,“client”:“conn198”,“doc”:{“driver”:{“name”:“NetworkInterfaceTL-TaskExecutorPool-0”,“version”:“6.0.5-4”},“os”:{“type”:“Linux”,“name”:“Ubuntu”,“architecture”:“x86_64”,“version”:“22.04”}}}}
{“t”:{“$date”:“2023-05-16T15:58:30.748+00:00”},“s”:“I”, “c”:“NETWORK”, “id”:22944, “ctx”:“conn192”,“msg”:“Connection ended”,“attr”:{“remote”:“192.168.1.131:58022”,“uuid”:“1c661fc1-7016-466d-9a78-8f2c9f2727c0”,“connectionId”:192,“connectionCount”:33}}

192.168.1.134:
{“t”:{“$date”:“2023-05-16T15:58:24.627+00:00”},“s”:“I”, “c”:“SH_REFR”, “id”:4619901, “ctx”:“CatalogCache-42”,“msg”:“Refreshed cached collection”,“attr”:{“namespace”:“config.system.sessions”,“lookupSinceVersion”:“1|1024||64620025b37a33f963149d0f||Timestamp(1684144165, 9)”,“newVersion”:“{ chunkVersion: { t: Timestamp(1684144165, 9), e: ObjectId(‘64620025b37a33f963149d0f’), v: Timestamp(1, 1024) }, forcedRefreshSequenceNum: 169, epochDisambiguatingSequenceNum: 127 }”,“timeInStore”:“{ chunkVersion: "None", forcedRefreshSequenceNum: 168, epochDisambiguatingSequenceNum: 126 }”,“durationMillis”:7}}

192.168.1.138:
{“t”:{“$date”:“2023-05-16T15:56:43.886+00:00”},“s”:“I”, “c”:“NETWORK”, “id”:22944, “ctx”:“conn972”,“msg”:“Connection ended”,“attr”:{“remote”:“192.168.1.142:45586”,“uuid”:“f298e40e-8c17-4340-aa2d-1bca66a1c7e3”,“connectionId”:972,“connectionCount”:42}}
{“t”:{“$date”:“2023-05-16T15:59:25.269+00:00”},“s”:“I”, “c”:“SH_REFR”, “id”:4619901, “ctx”:“CatalogCache-590”,“msg”:“Refreshed cached collection”,“attr”:{“namespace”:“config.system.sessions”,“lookupSinceVersion”:“1|1024||64620025b37a33f963149d0f||Timestamp(1684144165, 9)”,“newVersion”:“{ chunkVersion: { t: Timestamp(1684144165, 9), e: ObjectId(‘64620025b37a33f963149d0f’), v: Timestamp(1, 1024) }, forcedRefreshSequenceNum: 3441, epochDisambiguatingSequenceNum: 3437 }”,“timeInStore”:“{ chunkVersion: "None", forcedRefreshSequenceNum: 3440, epochDisambiguatingSequenceNum: 3436 }”,“durationMillis”:1}}

Is it possible that I did something wrong in the installation process?

Max_Chen · October 19, 2023, 5:53am

I have also met this error, after checking the code, It only takes 15 seconds to wait for the restore to start, unfortunately, it is not configurable, after changing to 300 seconds and rebuilding the restore process was successful.


Starting restore to the point in time '2023-10-19T00:00:00'.................................Started.

The error before udate:
Starting restore to the point in time '2023-10-19T00:00:00'...............Error: no confirmation that restore has successfully started. Replsets status:

Topic		Replies	Views
PBM restore failed after percona version upgrade Percona Backup for MongoDB	1	314	February 15, 2024
Restoring to new server Percona Backup for MongoDB	1	723	December 28, 2021
Not able to do successful backup Percona Backup for MongoDB	5	1348	November 11, 2020
Pbm backup error in a replicaset non-sharded scenario Percona Backup for MongoDB	3	1008	August 11, 2021
Unable to restore the backup from any of the backup files using PBM agent! Percona Backup for MongoDB percona	1	50	April 22, 2025

Pbm restore fails

Related topics