We have tried the initial sync procedure and provided the output Mongod-failure-after-restore . We are also using PBM for backup and restore operations.
We are using a cloud infrastructure running on Ubuntu with three servers: one primary and two secondary. We are running Percona Backup for MongoDB (PBM) version v2.10. For backups, we use Amazon S3 as the storage destination.
However, when we attempt to restore, the mongod service fails even though the restoration process is reported as successful. After this, all three mongod services fail. I am trying to restore the data only on a secondary server.
Command used for backup:
pbm backup --type=physical
Command used for restore:
pbm restore --time=“2025-09-24T13:30:00Z”
Thanks for the explanation and more context - this definitely helps. Why do you want to use PBM instead of initial sync to “rebuild/recover” 1/3 nodes? As @Ivan_Groenewold already wrote in the other topic - PBM is used to recover all nodes rather than a specific one.
We are taking incremental backups through S3. When we initiate the backup from the secondary server, it works successfully. However, when we try to restore the data to the secondary server, all mongod processes on both the primary and secondary servers fail. Additionally, we are unable to perform this backup and restore activity using the Initial Sync process.
Through the Initial Sync process, we cannot take or restore database backups via S3, as it is a MongoDB replication-level rebuild mechanism, not a PBM-based backup/restore method.
Right, you cannot do “backup” for initial-sync… but why would you?
Can we take one step back and clarify what’s the use case you’re trying to implement? Why do you want to restore just 1 out of 3 nodes? Something is broken or that’s a “recovery” procedure for any node? Something else?
Right now, we are performing a POC with the goal of taking a full data backup using PBM. After that, we plan to enable incremental PITR to take hourly backups. During the POC, we successfully pushed the data to S3 through PBM. However, when we tried to restore the same data after enabling PITR, the data restoration completed successfully, but all mongod processes on both the primary and secondary servers failed. We have also configured a replica set between the primary and secondary servers to ensure data synchronization.
The procedure I followed for the restore was: I removed the MongoDB data directory, but after doing so the mongod service failed to start. I started it again and then executed the restore command pbm restore 2025-10-30T07:01:38Z. A few minutes later, the mongod service went inactive. I’ve also attached the last few logs.
2025-10-30T05:51:17.000+0000 D [restore/2025-10-30T05:38:05.927063022Z] download stat: buf 536870912, arena 268435456, span 33554432, spanNum 8, cc 2, [{2 0} {2 0}]
2025-10-30T05:51:17.000+0000 I [restore/2025-10-30T05:38:05.927063022Z] preparing data
2025-10-30T05:51:35.000+0000 D [restore/2025-10-30T05:38:05.927063022Z] oplogTruncateAfterPoint: {1761737952 1}
2025-10-30T05:51:37.000+0000 I [restore/2025-10-30T05:38:05.927063022Z] recovering oplog as standalone
2025-10-30T05:51:54.000+0000 I [restore/2025-10-30T05:38:05.927063022Z] clean-up and reset replicaset config
2025-10-30T05:52:06.000+0000 D [restore/2025-10-30T05:38:05.927063022Z] uploading “.pbm.restore/2025-10-30T05:38:05.927063022Z/rs.poc/node.10.200.10.98:27018.hb” [size hint: 10 (10.00B); part size: 10485760 (10.00MB)]
2025-10-30T05:52:06.000+0000 D [restore/2025-10-30T05:38:05.927063022Z] uploading “.pbm.restore/2025-10-30T05:38:05.927063022Z/rs.poc/rs.hb” [size hint: 10 (10.00B); part size: 10485760 (10.00MB)]
2025-10-30T05:52:06.000+0000 D [restore/2025-10-30T05:38:05.927063022Z] uploading “.pbm.restore/2025-10-30T05:38:05.927063022Z/cluster.hb” [size hint: 10 (10.00B); part size: 10485760 (10.00MB)]
2025-10-30T05:52:12.000+0000 D [restore/2025-10-30T05:38:05.927063022Z] dropping ‘admin.pbmAgents’
2025-10-30T05:52:12.000+0000 D [restore/2025-10-30T05:38:05.927063022Z] dropping ‘admin.pbmBackups’
2025-10-30T05:52:12.000+0000 D [restore/2025-10-30T05:38:05.927063022Z] dropping ‘admin.pbmRestores’
2025-10-30T05:52:12.000+0000 D [restore/2025-10-30T05:38:05.927063022Z] dropping ‘admin.pbmCmd’
2025-10-30T05:52:12.000+0000 D [restore/2025-10-30T05:38:05.927063022Z] dropping ‘admin.pbmPITRChunks’
2025-10-30T05:52:12.000+0000 D [restore/2025-10-30T05:38:05.927063022Z] dropping ‘admin.pbmPITR’
2025-10-30T05:52:12.000+0000 D [restore/2025-10-30T05:38:05.927063022Z] dropping ‘admin.pbmOpLog’
2025-10-30T05:52:12.000+0000 D [restore/2025-10-30T05:38:05.927063022Z] dropping ‘admin.pbmLockOp’
2025-10-30T05:52:12.000+0000 D [restore/2025-10-30T05:38:05.927063022Z] dropping ‘admin.pbmLock’
2025-10-30T05:52:12.000+0000 D [restore/2025-10-30T05:38:05.927063022Z] dropping ‘admin.pbmLock’
2025-10-30T05:52:12.000+0000 D [restore/2025-10-30T05:38:05.927063022Z] dropping ‘admin.pbmLog’
2025-10-30T05:52:14.000+0000 I [restore/2025-10-30T05:38:05.927063022Z] restore on node succeed
2025-10-30T05:52:14.000+0000 I [restore/2025-10-30T05:38:05.927063022Z] moving to state done
2025-10-30T05:52:14.000+0000 D [restore/2025-10-30T05:38:05.927063022Z] uploading “.pbm.restore/2025-10-30T05:38:05.927063022Z/rs.poc/node.10.200.10.98:27018.done” [size hint: 10 (10.00B); part size: 10485760 (10.00MB)]
2025-10-30T05:52:14.000+0000 I [restore/2025-10-30T05:38:05.927063022Z] waiting for done status in rs map[.pbm.restore/2025-10-30T05:38:05.927063022Z/rs.poc/node.10.200.10.104:27018:{} .pbm.restore/2025-10-30T05:38:05.927063022Z/rs.poc/node.10.200.10.94:27018:{} .pbm.restore/2025-10-30T05:38:05.927063022Z/rs.poc/node.10.200.10.98:27018:{}]
2025-10-30T05:52:19.000+0000 D [restore/2025-10-30T05:38:05.927063022Z] uploading “.pbm.restore/2025-10-30T05:38:05.927063022Z/rs.poc/rs.done” [size hint: 10 (10.00B); part size: 10485760 (10.00MB)]
2025-10-30T05:52:19.000+0000 I [restore/2025-10-30T05:38:05.927063022Z] waiting for shards map[.pbm.restore/2025-10-30T05:38:05.927063022Z/rs.poc/rs:{}]
2025-10-30T05:52:24.000+0000 D [restore/2025-10-30T05:38:05.927063022Z] uploading “.pbm.restore/2025-10-30T05:38:05.927063022Z/cluster.done” [size hint: 10 (10.00B); part size: 10485760 (10.00MB)]
2025-10-30T05:52:24.000+0000 I [restore/2025-10-30T05:38:05.927063022Z] waiting for cluster
2025-10-30T05:52:29.000+0000 D [restore/2025-10-30T05:38:05.927063022Z] converged to state done
2025-10-30T05:52:29.000+0000 D [restore/2025-10-30T05:38:05.927063022Z] uploading “.pbm.restore/2025-10-30T05:38:05.927063022Z/rs.poc/stat.10.200.10.98:27018” [size hint: 73 (73.00B); part size: 10485760 (10.00MB)]
2025-10-30T05:52:29.000+0000 I [restore/2025-10-30T05:38:05.927063022Z] writing restore meta
2025-10-30T05:52:29.000+0000 W [restore/2025-10-30T05:38:05.927063022Z] meta .pbm.restore/2025-10-30T05:38:05.927063022Z.json already exists, trying write done status with ‘’
2025-10-30T05:52:29.000+0000 D [restore/2025-10-30T05:38:05.927063022Z] rm tmp conf
2025-10-30T05:52:29.000+0000 D [restore/2025-10-30T05:38:05.927063022Z] wait for cluster status
2025-10-30T05:52:34.000+0000 D [restore/2025-10-30T05:38:05.927063022Z] no cleanup strategy to apply
2025-10-30T05:52:34.000+0000 I [restore/2025-10-30T05:38:05.927063022Z] recovery successfully finished
2025-10-30T05:52:34.000+0000 I change stream was closed
2025-10-30T05:52:34.000+0000 D [restore/2025-10-30T05:38:05.927063022Z] hearbeats stopped
2025-10-30T05:52:34.000+0000 I Exit:
2025-10-30T05:52:34.000+0000 D [agentCheckup] deleting agent status
2025-10-30T05:53:05.000+0000 E Exit: connect to PBM: create mongo connection: ping: server selection error: server selection timeout, current topology: { Type: Unknown, Servers: [{ Addr: 127.0.0.1:27018, Type: Unknown, Last error: dial tcp 127.0.0.1:27018: connect: connection refused }, ] }
2025-10-30T05:53:35.000+0000 E Exit: connect to PBM: create mongo connection: ping: server selection error: server selection timeout, current topology: { Type: Unknown, Servers: [{ Addr: 127.0.0.1:27018, Type: Unknown, Last error: dial tcp 127.0.0.1:27018: connect: connection refused }, ] }
This states that the PBM agent connection to the PSMDB was unexpectedly interrupted. Can you please provide logs from the server so we can see what happened on the server side at that time?