Can't restore a Physical Backup using PBM 2.0.5

Hi,

In a POC environment, I’m trying to restore a physical backup from a different replica set cluster with the same replica set name but I got some errors.
In order to give some context, I have a 3-node replica set cluster, let’s call it rsABC, where I set up and configured PBM and I was able to run physical backups with success and stream them to a remote storage account. Then, in a different 3-node replica set cluster, with the same replica set name, I set up and configured PBM and pointed the storage to the same remote storage account. By doing a pbm --force-resync I was able to have all backups available to restore. I don’t have mongos or arbiter nodes, the configuration is 1 primary and 2 secondaries (1 of them hidden). Then I started a physical restore based on a complete physical backup.
During the restore, in the pbm-agent journal appeared several request error messages related to PUT requests of .pbm.restore files with the response error ServiceCode=InvalidBlockList. This, per se, didn’t seem to abort the restore process. However, at the end of the process, after the preparing data phase, all the dbpath is cleaned up and the restored data is removed. Also, there is a message stating mongod process: exit status 100. In the journal, there is no other errors.
When checking describe-restore, there are odd error messages stating, for example and for one of the nodes:

      terminating / 2023-04-13T15:57:00.453+00:00, connect err: ping: server selection
      error: server selection timeout, current topology: { Type: Single, Servers:
      [{ Addr: localhost:27527, Type: Unknown, Last error: dial tcp 127.0.0.1:27527:
      connect: connection refused }, ] }

I’m using port 27017 for MongoDB which is accessible so I don’t really understand why it is using a different port for each node like 27527.

You can find additional logging and details, both regarding describe-restore and pbm-agent journal, in:
pbm_physical_restore_failure.txt (50.5 KB)

Each node runs in a VM with:

  • PSMDB 4.4.9-10
  • CentOS 7.9
  • PBM 2.0.5

Do you have any clue of what I may be doing wrong or if this may be a bug?
Should I create a Jira ticket?

Thanks in advance.
Kind regards,
João Soares

Hi,

During physical restore pbm-agent stops mongod, cleans up data directory and copies data from the storage onto every node. After data is copied, PBM performs several DB restarts to finish the restore process, different ‘random’ ports for mongod are used in order not to intersect with another mongod process if for some reason they run on the same host. The logs for these mongod starts are saved to ‘pbm.restore.log’ file inside dbpath dir.

Since in your case mongod couldn’t start after ‘preparing data’ phase, could you please check pbm.restore.log which should be located in /data-path/mongodb/rsABC-1/ for more details?

Hi,

I forgot to send the pbm.restore.log in the first message, sorry.
In order to have fresh logs I started a new restore and you can find all the details in the file below. The logs are from the primary node.
pbm_physical_restore_failure_2.txt (122.1 KB)
I only have one mongod process running in each VM and it is the one I’m using with PBM.

PBM stops mongod process at the beginning of the restore and it should remain stopped until restore is finished. However, it seems that mongod was started in the middle of the process even before data was fully copied. Do you have any kind of automation that might restart mongod or maybe “restart” clause in systemctl unit file?

Hi,

You are absolutely right. I have a drop-in that restarts the mongod service when its down (unless it was manually stopped). It must be it. I will give it a try and update this thread afterwards.

Thanks a lot for your help.

1 Like

Hi,

It worked perfectly.
Thank you again.

Kind regards,
João Soares

Glad that it worked, happy to help)

I am also getting error during restore. The process of copying files from remote storage to MongoDB pod was successful, but when mongod started, it got an error as shown below. I use the same version cluster mongodb to backup and restore. Can you help @ Sandra_Romanchenko?
I’m using:

Hi @dung-tien-nguyen ,

Have you asked in MongoDB Operator “category”?

Because before asking here, you have to ensure you did everything as it is documented for the operator and it is not an issue of the operator itself. We are not competent enough in every aspect of the operator (in that category, the operator developers can answer you more quickly and simply).
Also, if it is an issue of PBM, the team will inform us, and we will better understand the problem.

One more request, could you please create a new topic instead of “reusing” this one? For other people, it will confuse what the original issue was and how it was resolved.

best regards

Thank for your response, I will create a new topic for this problem.
Have a nice day!