PBM restore starting but pbm status show agent connect to mongod and config server failed

Hi All,

After pbm backup is completed, I execute pbm restore. Since I can’t mount a NFS system, I copy both backup files of config server and mongod to all of three nodes. The following is my enviroment"
server 1: mongod port 27018, config server 27019, mongos 27020
server 2: mongod port 27018, config server 27019, mongos 27020
server 2: mongod port 27018, config server 27019, mongos 27020
For each server, there are two pbm-agent instances to access mongod and config server. The backup folder is “/backup”. I copied backupset of config server and mongod to “/backup/” folder of each server.

I type “pbm restore backupname”. The pbm status shows the following message:

Cluster:
========
configsvr:
  - configsvr/192.168.1.40:27019: pbm-agent  FAILED status:
      > ERROR with ERROR: lost agent, last heartbeat: 1663295876
  - configsvr/192.168.1:27019: pbm-agent  FAILED status:
      > ERROR with ERROR: lost agent, last heartbeat: 1663295876
  - configsvr/192.168.10.40:27019: pbm-agent  FAILED status:
      > ERROR with ERROR: lost agent, last heartbeat: 1663295872
stb1:
  - stb1/192.168.1.40:27018: pbm-agent  FAILED status:
      > ERROR with ERROR: lost agent, last heartbeat: 1663295872
  - stb1/192.168.1.41:27018: pbm-agent  FAILED status:
      > ERROR with ERROR: lost agent, last heartbeat: 1663295871
  - stb1/192.168.10.40:27018: pbm-agent  FAILED status:
      > ERROR with ERROR: lost agent, last heartbeat: 1663295873


PITR incremental backup:
========================
Status [OFF]

Currently running:
==================
(none)

Backups:
========
FS  /backup
  Snapshots:
    2022-09-16T01:47:12Z 14.69GB <physical> [complete: 2022-09-16T01:47:07Z]
    2022-09-16T01:18:26Z 18.73GB <physical> [complete: 2022-09-16T01:18:25Z]
    2022-09-14T07:32:39Z 0.00B <physical> [ERROR: upload file `/usr/local/mongodb-configsvr/collection-56-8997624888314356377.wt`: upload file: read data: open file for reading: open /usr/local/mongodb-configsvr/collection-56-8997624888314356377.wt: permission denied.] [2022-09-14T07:32:44Z]

I check the pbm logs:

2022-09-16T02:35:49Z I [stb1/192.168.10.40:27018] [restore/2022-09-16T01:47:12Z] restore started
2022-09-16T02:35:49Z I [configsvr/192.168.10.40:27019] [restore/2022-09-16T01:47:12Z] restore started
2022-09-16T02:35:50Z I [stb1/192.168.10.40:27018] [restore/2022-09-16T01:47:12Z] moving to state running
2022-09-16T02:35:50Z I [configsvr/192.168.10.40:27019] [restore/2022-09-16T01:47:12Z] moving to state running
2022-09-16T02:36:37Z I [stb1/192.168.1.40:27018] starting PITR routine
2022-09-16T02:36:37Z I [stb1/192.168.1.40:27018] pbm-agent:
Version:   1.8.1
Platform:  linux/amd64
GitCommit: e3f8f2c535b3a8b69a54a86fc9271d3e6de5ff80
GitBranch: release-1.8.1
BuildTime: 2022-07-04_08:15_UTC
GoVersion: go1.16.9
2022-09-16T02:36:37Z I [stb1/192.168.1.40:27018] node: stb1/192.168.1.40:27018
2022-09-16T02:36:37Z I [stb1/192.168.1.40:27018] listening for the commands
2022-09-16T02:36:41Z I [configsvr/192.168.1.40:27019] pbm-agent:
Version:   1.8.1
Platform:  linux/amd64
GitCommit: e3f8f2c535b3a8b69a54a86fc9271d3e6de5ff80
GitBranch: release-1.8.1
BuildTime: 2022-07-04_08:15_UTC
GoVersion: go1.16.9
2022-09-16T02:36:41Z I [configsvr/192.168.1.40:27019] starting PITR routine
2022-09-16T02:36:41Z I [configsvr/192.168.1.40:27019] node: configsvr/192.168.1.40:27019
2022-09-16T02:36:41Z I [configsvr/192.168.1.40:27019] listening for the commands
2022-09-16T02:37:56Z I [stb1/192.168.1.40:27018] got command restore [name: 2022-09-16T02:37:55.934711134Z, backup name: 2022-09-16T01:47:12Z] <ts: 1663295875>
2022-09-16T02:37:56Z I [stb1/192.168.1.40:27018] got epoch {1663292827 180}
2022-09-16T02:37:56Z I [configsvr/192.168.1.40:27019] got command restore [name: 2022-09-16T02:37:55.934711134Z, backup name: 2022-09-16T01:47:12Z] <ts: 1663295875>
2022-09-16T02:37:56Z I [configsvr/192.168.1.40:27019] got epoch {1663292827 180}
2022-09-16T02:37:57Z I [stb1/192.168.1.40:27018] [restore/2022-09-16T01:47:12Z] restore started
2022-09-16T02:37:57Z I [configsvr/192.168.1.40:27019] [restore/2022-09-16T01:47:12Z] restore started
2022-09-16T02:37:58Z I [stb1/192.168.1.40:27018] [restore/2022-09-16T01:47:12Z] moving to state running
2022-09-16T02:37:58Z I [configsvr/192.168.1.40:27019] [restore/2022-09-16T01:47:12Z] moving to state running

Is the pbm status incorrect? How to check the pbm restore process status in my case?

Thanks,
Dillon

Aha, after I created a NFS mount point to share the backup between three servers. The restore operation worked.

I guess something such as signal information is written under backup folder. And yes, I see a “.pbm.restore” folder. There are some files show the progress status. So, I just copied backupset to all three servers broken the operation.

1 Like