Hi,
In a POC environment, I’m trying to restore a physical backup from a different replica set cluster with the same replica set name but I got some errors.
In order to give some context, I have a 3-node replica set cluster, let’s call it rsABC
, where I set up and configured PBM and I was able to run physical backups with success and stream them to a remote storage account. Then, in a different 3-node replica set cluster, with the same replica set name, I set up and configured PBM and pointed the storage to the same remote storage account. By doing a pbm --force-resync
I was able to have all backups available to restore. I don’t have mongos or arbiter nodes, the configuration is 1 primary and 2 secondaries (1 of them hidden). Then I started a physical restore based on a complete physical backup.
During the restore, in the pbm-agent
journal appeared several request error messages related to PUT requests of .pbm.restore
files with the response error ServiceCode=InvalidBlockList
. This, per se, didn’t seem to abort the restore process. However, at the end of the process, after the preparing data
phase, all the dbpath is cleaned up and the restored data is removed. Also, there is a message stating mongod process: exit status 100
. In the journal, there is no other errors.
When checking describe-restore
, there are odd error messages stating, for example and for one of the nodes:
terminating / 2023-04-13T15:57:00.453+00:00, connect err: ping: server selection
error: server selection timeout, current topology: { Type: Single, Servers:
[{ Addr: localhost:27527, Type: Unknown, Last error: dial tcp 127.0.0.1:27527:
connect: connection refused }, ] }
I’m using port 27017 for MongoDB which is accessible so I don’t really understand why it is using a different port for each node like 27527.
You can find additional logging and details, both regarding describe-restore and pbm-agent journal, in:
pbm_physical_restore_failure.txt (50.5 KB)
Each node runs in a VM with:
- PSMDB 4.4.9-10
- CentOS 7.9
- PBM 2.0.5
Do you have any clue of what I may be doing wrong or if this may be a bug?
Should I create a Jira ticket?
Thanks in advance.
Kind regards,
João Soares