Restore process hanged up

I’ve started a sample restore, and it simply hung.

Details:

According to the pbm status, the restore started on today, at 09:51, local time.

[feng.sian@DSAGIMOGDBSP01 ~]$ pbm status
...
Currently running:
==================
Snapshot restore "2021-02-25T18:43:59Z", started at 2021-02-26T12:51:27Z. Status: snapshot restore. [op id: 6038eecfe36b8a6ceed1e538]

Backups:
========
FS  /data/backup/DEV
  Snapshots:
    2021-02-26T12:50:05Z 35.42KB [complete: 2021-02-26T12:50:25]
    2021-02-25T19:28:44Z 22.48KB [complete: 2021-02-25T19:29:04]
    2021-02-25T18:43:59Z 18.19KB [complete: 2021-02-25T18:44:20]
    2021-02-25T12:34:44Z 14.48KB [complete: 2021-02-25T12:35:05]
    2021-02-25T11:04:20Z 11.98KB [complete: 2021-02-25T11:04:40]
    2021-02-24T23:23:03Z 8.92KB [complete: 2021-02-24T23:23:23]
  PITR chunks:
    2021-02-25T19:29:04 - 2021-02-26T01:01:04 320.95KB
    2021-02-25T18:44:20 - 2021-02-25T19:29:03 51.45KB
    2021-02-25T12:35:05 - 2021-02-25T18:44:19 403.16KB
    2021-02-25T11:04:40 - 2021-02-25T12:35:03 47.64KB

It is already 19:33 and the pbm status still says it is currently running the restore.

Checking out the pbm logs:

...
2021-02-26T12:50:05Z I [rs-dev-00/DSAGIMOGDBSP01.axabr.ad:27001] got epoch {1614301247 1}
2021-02-26T12:50:22Z I [rs-dev-00/DSAGIMOGDBSP01.axabr.ad:27001] [backup/2021-02-26T12:50:05Z] backup started
2021-02-26T12:50:25Z I [rs-dev-00/DSAGIMOGDBSP01.axabr.ad:27001] [backup/2021-02-26T12:50:05Z] mongodump finished, waiting for the oplog
2021-02-26T12:50:30Z I [rs-dev-00/DSAGIMOGDBSP01.axabr.ad:27001] [backup/2021-02-26T12:50:05Z] backup finished
2021-02-26T12:51:27Z I [rs-dev-00/DSAGIMOGDBSP01.axabr.ad:27001] got command restore [name: 2021-02-26T12:51:27.166184056Z, backup name: 2021-02-25T18:43:59Z] <ts: 1614343887>
2021-02-26T12:51:27Z I [rs-dev-00/DSAGIMOGDBSP01.axabr.ad:27001] got epoch {1614343830 1}
2021-02-26T12:51:27Z I [rs-dev-00/DSAGIMOGDBSP01.axabr.ad:27001] [restore/2021-02-25T18:43:59Z] restore started
[feng.sian@DSAGIMOGDBSP01 ~]$

Nothing came out after the last line.

Checking the journalctl -u pbm-agent:

...
Feb 26 09:51:27 DSAGIMOGDBSP01 pbm-agent[107868]: 2021-02-26T09:51:27.000-0300 I [restore/2021-02-25T18:43:59Z] restore started
Feb 26 09:51:30 DSAGIMOGDBSP01 pbm-agent[107868]: 2021-02-26T09:51:30.585-0300        connected to node type: replset
Feb 26 09:51:30 DSAGIMOGDBSP01 pbm-agent[107868]: 2021-02-26T09:51:30.595-0300        archive prelude admin.pbmLog
Feb 26 09:51:30 DSAGIMOGDBSP01 pbm-agent[107868]: 2021-02-26T09:51:30.595-0300        archive prelude admin.pbmOpLog
Feb 26 09:51:30 DSAGIMOGDBSP01 pbm-agent[107868]: 2021-02-26T09:51:30.595-0300        archive prelude admin.pbmBackups
Feb 26 09:51:30 DSAGIMOGDBSP01 pbm-agent[107868]: 2021-02-26T09:51:30.595-0300        archive prelude config.system.indexBuilds
Feb 26 09:51:30 DSAGIMOGDBSP01 pbm-agent[107868]: 2021-02-26T09:51:30.596-0300        archive prelude wmISMonitoring.isServerStats
Feb 26 09:51:30 DSAGIMOGDBSP01 pbm-agent[107868]: 2021-02-26T09:51:30.596-0300        archive prelude admin.pbmConfig
Feb 26 09:51:30 DSAGIMOGDBSP01 pbm-agent[107868]: 2021-02-26T09:51:30.596-0300        archive prelude admin.pbmCmd
Feb 26 09:51:30 DSAGIMOGDBSP01 pbm-agent[107868]: 2021-02-26T09:51:30.596-0300        archive prelude admin.pbmRestores
Feb 26 09:51:30 DSAGIMOGDBSP01 pbm-agent[107868]: 2021-02-26T09:51:30.596-0300        archive prelude admin.pbmPITRChunks
Feb 26 09:51:30 DSAGIMOGDBSP01 pbm-agent[107868]: 2021-02-26T09:51:30.596-0300        archive prelude admin.pbmLock
Feb 26 09:51:30 DSAGIMOGDBSP01 pbm-agent[107868]: 2021-02-26T09:51:30.596-0300        archive prelude admin.pbmAgents
Feb 26 09:51:30 DSAGIMOGDBSP01 pbm-agent[107868]: 2021-02-26T09:51:30.596-0300        archive prelude admin.pbmLockOp
Feb 26 09:51:30 DSAGIMOGDBSP01 pbm-agent[107868]: 2021-02-26T09:51:30.596-0300        archive prelude admin.system.users
Feb 26 09:51:30 DSAGIMOGDBSP01 pbm-agent[107868]: 2021-02-26T09:51:30.596-0300        archive prelude admin.system.roles
Feb 26 09:51:30 DSAGIMOGDBSP01 pbm-agent[107868]: 2021-02-26T09:51:30.596-0300        archive prelude admin.system.version
Feb 26 09:51:30 DSAGIMOGDBSP01 pbm-agent[107868]: 2021-02-26T09:51:30.596-0300        archive format version "0.1"
Feb 26 09:51:30 DSAGIMOGDBSP01 pbm-agent[107868]: 2021-02-26T09:51:30.596-0300        archive server version "4.4.1-3"
Feb 26 09:51:30 DSAGIMOGDBSP01 pbm-agent[107868]: 2021-02-26T09:51:30.596-0300        archive tool version "0.0.1"
Feb 26 09:51:30 DSAGIMOGDBSP01 pbm-agent[107868]: 2021-02-26T09:51:30.596-0300        setting number of parallel collections to number of parallel collections in archive (2)
Feb 26 09:51:30 DSAGIMOGDBSP01 pbm-agent[107868]: 2021-02-26T09:51:30.599-0300        preparing collections to restore from
Feb 26 09:51:30 DSAGIMOGDBSP01 pbm-agent[107868]: 2021-02-26T09:51:30.599-0300        skipping restoring admin.pbmLog, it is excluded
Feb 26 09:51:30 DSAGIMOGDBSP01 pbm-agent[107868]: 2021-02-26T09:51:30.599-0300        skipping restoring admin.pbmLog metadata, it is excluded
Feb 26 09:51:30 DSAGIMOGDBSP01 pbm-agent[107868]: 2021-02-26T09:51:30.599-0300        found collection admin.pbmOpLog bson to restore to admin.pbmOpLog
Feb 26 09:51:30 DSAGIMOGDBSP01 pbm-agent[107868]: 2021-02-26T09:51:30.599-0300        found collection metadata from admin.pbmOpLog to restore to admin.pbmOpLog
Feb 26 09:51:30 DSAGIMOGDBSP01 pbm-agent[107868]: 2021-02-26T09:51:30.599-0300        skipping restoring admin.pbmBackups, it is excluded
Feb 26 09:51:30 DSAGIMOGDBSP01 pbm-agent[107868]: 2021-02-26T09:51:30.599-0300        skipping restoring admin.pbmBackups metadata, it is excluded
Feb 26 09:51:30 DSAGIMOGDBSP01 pbm-agent[107868]: 2021-02-26T09:51:30.599-0300        skipping restoring admin.pbmConfig, it is excluded
Feb 26 09:51:30 DSAGIMOGDBSP01 pbm-agent[107868]: 2021-02-26T09:51:30.599-0300        skipping restoring admin.pbmConfig metadata, it is excluded
Feb 26 09:51:30 DSAGIMOGDBSP01 pbm-agent[107868]: 2021-02-26T09:51:30.599-0300        skipping restoring admin.pbmCmd, it is excluded
Feb 26 09:51:30 DSAGIMOGDBSP01 pbm-agent[107868]: 2021-02-26T09:51:30.599-0300        skipping restoring admin.pbmCmd metadata, it is excluded
Feb 26 09:51:30 DSAGIMOGDBSP01 pbm-agent[107868]: 2021-02-26T09:51:30.599-0300        skipping restoring admin.pbmRestores, it is excluded
Feb 26 09:51:30 DSAGIMOGDBSP01 pbm-agent[107868]: 2021-02-26T09:51:30.599-0300        skipping restoring admin.pbmRestores metadata, it is excluded
Feb 26 09:51:30 DSAGIMOGDBSP01 pbm-agent[107868]: 2021-02-26T09:51:30.599-0300        skipping restoring admin.pbmPITRChunks, it is excluded
Feb 26 09:51:30 DSAGIMOGDBSP01 pbm-agent[107868]: 2021-02-26T09:51:30.599-0300        skipping restoring admin.pbmPITRChunks metadata, it is excluded
Feb 26 09:51:30 DSAGIMOGDBSP01 pbm-agent[107868]: 2021-02-26T09:51:30.599-0300        skipping restoring admin.pbmLock, it is excluded
Feb 26 09:51:30 DSAGIMOGDBSP01 pbm-agent[107868]: 2021-02-26T09:51:30.599-0300        skipping restoring admin.pbmLock metadata, it is excluded
Feb 26 09:51:30 DSAGIMOGDBSP01 pbm-agent[107868]: 2021-02-26T09:51:30.599-0300        found collection admin.pbmAgents bson to restore to admin.pbmAgents
Feb 26 09:51:30 DSAGIMOGDBSP01 pbm-agent[107868]: 2021-02-26T09:51:30.599-0300        found collection metadata from admin.pbmAgents to restore to admin.pbmAgents
Feb 26 09:51:30 DSAGIMOGDBSP01 pbm-agent[107868]: 2021-02-26T09:51:30.599-0300        found collection admin.pbmLockOp bson to restore to admin.pbmLockOp
Feb 26 09:51:30 DSAGIMOGDBSP01 pbm-agent[107868]: 2021-02-26T09:51:30.599-0300        found collection metadata from admin.pbmLockOp to restore to admin.pbmLockOp
Feb 26 09:51:30 DSAGIMOGDBSP01 pbm-agent[107868]: 2021-02-26T09:51:30.599-0300        found collection admin.system.users bson to restore to admin.pbmRUsers
Feb 26 09:51:30 DSAGIMOGDBSP01 pbm-agent[107868]: 2021-02-26T09:51:30.599-0300        found collection metadata from admin.system.users to restore to admin.pbmRUsers
Feb 26 09:51:30 DSAGIMOGDBSP01 pbm-agent[107868]: 2021-02-26T09:51:30.599-0300        found collection admin.system.roles bson to restore to admin.pbmRRoles
Feb 26 09:51:30 DSAGIMOGDBSP01 pbm-agent[107868]: 2021-02-26T09:51:30.599-0300        found collection metadata from admin.system.roles to restore to admin.pbmRRoles
Feb 26 09:51:30 DSAGIMOGDBSP01 pbm-agent[107868]: 2021-02-26T09:51:30.599-0300        skipping restoring admin.system.version, it is excluded
Feb 26 09:51:30 DSAGIMOGDBSP01 pbm-agent[107868]: 2021-02-26T09:51:30.599-0300        skipping restoring admin.system.version metadata, it is excluded
Feb 26 09:51:30 DSAGIMOGDBSP01 pbm-agent[107868]: 2021-02-26T09:51:30.599-0300        found collection config.system.indexBuilds bson to restore to config.system.indexBuilds
Feb 26 09:51:30 DSAGIMOGDBSP01 pbm-agent[107868]: 2021-02-26T09:51:30.599-0300        found collection metadata from config.system.indexBuilds to restore to config.system.indexBuilds
Feb 26 09:51:30 DSAGIMOGDBSP01 pbm-agent[107868]: 2021-02-26T09:51:30.599-0300        found collection wmISMonitoring.isServerStats bson to restore to wmISMonitoring.isServerStats
Feb 26 09:51:30 DSAGIMOGDBSP01 pbm-agent[107868]: 2021-02-26T09:51:30.599-0300        found collection metadata from wmISMonitoring.isServerStats to restore to wmISMonitoring.isServerStats
Feb 26 09:51:30 DSAGIMOGDBSP01 pbm-agent[107868]: 2021-02-26T09:51:30.608-0300        first non special collection admin.system.users found. The demultiplexer will handle it and the remainder
Feb 26 09:51:30 DSAGIMOGDBSP01 pbm-agent[107868]: 2021-02-26T09:51:30.608-0300        restoring up to 2 collections in parallel
Feb 26 09:51:30 DSAGIMOGDBSP01 pbm-agent[107868]: 2021-02-26T09:51:30.612-0300        collection admin.pbmRUsers doesn't exist, skipping drop command
Feb 26 09:51:30 DSAGIMOGDBSP01 pbm-agent[107868]: 2021-02-26T09:51:30.612-0300        reading metadata for admin.pbmRUsers from archive on stdin
Feb 26 09:51:30 DSAGIMOGDBSP01 pbm-agent[107868]: 2021-02-26T09:51:30.612-0300        creating collection admin.pbmRUsers using options from metadata
Feb 26 09:51:30 DSAGIMOGDBSP01 pbm-agent[107868]: 2021-02-26T09:51:30.629-0300        restoring admin.pbmRUsers from archive on stdin
Feb 26 09:51:30 DSAGIMOGDBSP01 pbm-agent[107868]: 2021-02-26T09:51:30.631-0300        using 10 insertion workers
Feb 26 09:51:31 DSAGIMOGDBSP01 pbm-agent[107868]: 2021-02-26T09:51:31.083-0300        restoring indexes for collection admin.pbmRUsers from metadata
[feng.sian@DSAGIMOGDBSP01 ~]$

So… what is going on?

1 Like

Hi Feng.

My first guess is that filesystem storage location /data/backup/DEV wasn’t a shared network drive. In a restore the pbm-agent on the primary reads the files and inserts. If it has a /data/backup/DEV directory but it’s empty it can’t do the restore.

But I agree there is a lack of information in the logs for you to see where things are.

But first could you confirm if the hypothesis that /data/backup/DEV directory on the primary node was empty, or not?

1 Like

Hi @Akira_Kurogane, thanks for your response.

Your guess is correct: the /data/backup/DEV isn’t indeed a shared network drive. Rather it is a local folder, for I was planning to test the whole backup and restore process before getting all the “trouble” for requesting an actual networked folder.
Does this (being not a shared networked folder) make the difference - that would prevent the pbm from finishing a restore?

Regarding the hipothesys of the /data/backup/DEV folder being empty, it is not empty:

[feng.sian@DSAGIMOGDBSP01 ~]$ ll /data/backup/DEV/
total 168
-rw-rw-r--. 1 pbm pbm  1350 Feb 24 20:23 2021-02-24T23:23:03Z.pbm.json
-rw-rw-r--. 1 pbm pbm  9136 Feb 24 20:23 2021-02-24T23:23:03Z_rs-dev-00.dump.s2
-rw-rw-r--. 1 pbm pbm   821 Feb 24 20:23 2021-02-24T23:23:03Z_rs-dev-00.oplog.s2
-rw-rw-r--. 1 pbm pbm  1350 Feb 25 08:04 2021-02-25T11:04:20Z.pbm.json
-rw-rw-r--. 1 pbm pbm 12267 Feb 25 08:04 2021-02-25T11:04:20Z_rs-dev-00.dump.s2
-rw-rw-r--. 1 pbm pbm   818 Feb 25 08:04 2021-02-25T11:04:20Z_rs-dev-00.oplog.s2
-rw-rw-r--. 1 pbm pbm  1387 Feb 25 09:35 2021-02-25T12:34:44Z.pbm.json
-rw-rw-r--. 1 pbm pbm 14827 Feb 25 09:35 2021-02-25T12:34:44Z_rs-dev-00.dump.s2
-rw-rw-r--. 1 pbm pbm   859 Feb 25 09:35 2021-02-25T12:34:44Z_rs-dev-00.oplog.s2
-rw-rw-r--. 1 pbm pbm  1387 Feb 25 15:44 2021-02-25T18:43:59Z.pbm.json
-rw-rw-r--. 1 pbm pbm 18625 Feb 25 15:44 2021-02-25T18:43:59Z_rs-dev-00.dump.s2
-rw-rw-r--. 1 pbm pbm   858 Feb 25 15:44 2021-02-25T18:43:59Z_rs-dev-00.oplog.s2
-rw-rw-r--. 1 pbm pbm  1387 Feb 25 16:29 2021-02-25T19:28:44Z.pbm.json
-rw-rw-r--. 1 pbm pbm 23020 Feb 25 16:29 2021-02-25T19:28:44Z_rs-dev-00.dump.s2
-rw-rw-r--. 1 pbm pbm  1806 Feb 25 16:29 2021-02-25T19:28:44Z_rs-dev-00.oplog.s2
-rw-rw-r--. 1 pbm pbm  1387 Feb 26 09:50 2021-02-26T12:50:05Z.pbm.json
-rw-rw-r--. 1 pbm pbm 36268 Feb 26 09:50 2021-02-26T12:50:05Z_rs-dev-00.dump.s2
-rw-rw-r--. 1 pbm pbm   863 Feb 26 09:50 2021-02-26T12:50:05Z_rs-dev-00.oplog.s2
drwxr-xr-x. 3 pbm pbm    23 Feb 25 09:35 pbmPitr
[feng.sian@DSAGIMOGDBSP01 ~]$
1 Like

Looking at another post (Help getting past download step on restore - #3 by trickshotcb, on matthewb’s reply), where he mentions

…download the backup… to your local machine first, then use pbm to restore that backup from local disk…

I understand that pbm makes no difference whether restoring from a “locally stored file” or from a shared network folder.

Having this said, what I can do to figure out what is happening? Is there any query I could do on the Mongo to figure out whether the restore process hs finished?

1 Like

An additional info here: this mongodb [where I am testing the pbm] is a replica set of only one node.
Does this make any difference at all?

1 Like

Yes, in the general case. A backup might be taken from the node on server A but if server B is the primary at the time of the restore then the pbm-agent on server B will have only an empty directory to read from.

But … now I see this update:

An additional info here: this mongodb [where I am testing the pbm] is a replica set of only one node .
Does this make any difference at all?

Yes. There is only one node; it must be the primary, so that backup directory must have been filled (and your “ls” output confirms it).

So we’re back to the start, which is that activity to re-insert collection data has paused way too long for database that takes only 1 min to backup. I haven’t got an idea yet - you’ve already done the thing I would which is 1) look at “pbm status” and then 2) look at “pbm log” (or directly run journalctl -u pbm-agent on all servers).

It seems it is be time to get into the state of the operations and locks in the pbm control collections. These should be reflected already in the output of “pbm status” but something isn’t right.

The PBM control collections are the pbm* ones in the “config” db of the non-sharded replica set (as your current case is) or the config server replica set if a cluster.

1 Like

I’ll make this same test on another env that has a 3-member replica set, and I’ll share the result.

1 Like

@feng.sian What version of pbm are you on? I feel like I had more problems with 1.4.1, so I reverted to 1.4.0 and it worked better.
Also, check the pbmLock collection. I was running into issues where the locks weren’t being cleared out from previous commands. Version 1.4.1 introduced additional lock handling, which may have inadvertently caused problems. I haven’t researched enough to say for sure there is a problem with 1.4.1, but it’s worth investigating.
And, in my tests, restoring to a single-node replicaset always worked better than a multi-node (in my case, 3 node) replicaset.

1 Like

@trickshotcb thanks for this heads up!
I am using the pbm 1.4.1. I’ll give a try on the version 1.4.0, and share the results.

1 Like

One other thing came up: issuing index creation also hang (on this situation of a ‘one-node’ replica set).

This doesn’t happen if the mongo is running as standalone or a replica set with more than 1 member. This makes me wonder whether the pbm restore also is limited by this constraint.

Currently I’m waiting to finally get the shared folder set up beween the servers give the pbm backup & restore a try.

As soon I have any news, I’ll post back here.

1 Like