Mongod-failure-after-restore

hemanth · October 21, 2025, 8:24am

Hello Ivan,

Just a follow-up on this ticket. Mongod-failure-after-restore - #2 by Ivan_Groenewold

I attempted to perform a restore on the secondary instance despite the Mongo service failure. The PBM status has been mentioned above.

Could you please share the documentation or Linux commands for performing the restore? It would be very helpful.

Thank you.

radoslaw.szulgo · October 21, 2025, 8:39am

I believe you need to follow the initial sync procedure as @Ivan_Groenewold mentioned. Here’s the guide:

hemanth · October 24, 2025, 5:39am

Hi radoslaw.szulgo,

we have tried initial sync procedure also given output here – > Mongod-failure-after-restore

hemanth · October 24, 2025, 7:05am

Hi radoslaw.szulgo

We have tried the initial sync procedure and provided the output Mongod-failure-after-restore . We are also using PBM for backup and restore operations.

radoslaw.szulgo · October 27, 2025, 8:54am

can you please provide some more information on what doesn’t work for you? Any error messages? Without that it’s really hard to help in your case.

hemanth · October 27, 2025, 9:06am

We are using a cloud infrastructure running on Ubuntu with three servers: one primary and two secondary. We are running Percona Backup for MongoDB (PBM) version v2.10. For backups, we use Amazon S3 as the storage destination.

However, when we attempt to restore, the mongod service fails even though the restoration process is reported as successful. After this, all three mongod services fail. I am trying to restore the data only on a secondary server.

Command used for backup:
pbm backup --type=physical

Command used for restore:
pbm restore --time=“2025-09-24T13:30:00Z”

PBM status after backup:

pbm status
Cluster:

poc:

private_ip:27018 [S]: pbm-agent [v2.10.0] OK
private_ip:27018 [S]: pbm-agent [v2.10.0] OK
private_ip:27018 [P]: pbm-agent [v2.10.0] OK

PITR incremental backup:

Status [ON]
Running members: poc/private_ip:27018

Currently running:

(none)

Backups:

S3 ap-south-1 s3://mongo-percona-bk
Snapshots:
2025-10-01T12:13:21Z 33.59GB success [restore_to_time: 2025-10-01T12:13:24]
PITR chunks [11.10MB]:
2025-10-01T12:13:25 - 2025-10-02T02:53:23

PBM status after restore:

pbm status
Cluster:

poc:

private_ip:27018 : pbm-agent [NOT FOUND]
private_ip:27018 : pbm-agent [NOT FOUND]
private_ip:27018 : pbm-agent [NOT FOUND]

PITR incremental backup:

Status [OFF]

Currently running:

(none)

Backups:

S3 ap-south-1 s3://mongo-percona-bk
(none)

radoslaw.szulgo · October 27, 2025, 9:44am

Thanks for the explanation and more context - this definitely helps. Why do you want to use PBM instead of initial sync to “rebuild/recover” 1/3 nodes? As @Ivan_Groenewold already wrote in the other topic - PBM is used to recover all nodes rather than a specific one.

hemanth · October 27, 2025, 10:16am

We are taking incremental backups through S3. When we initiate the backup from the secondary server, it works successfully. However, when we try to restore the data to the secondary server, all mongod processes on both the primary and secondary servers fail. Additionally, we are unable to perform this backup and restore activity using the Initial Sync process.

Through the Initial Sync process, we cannot take or restore database backups via S3, as it is a MongoDB replication-level rebuild mechanism, not a PBM-based backup/restore method.

radoslaw.szulgo · October 27, 2025, 10:21am

Right, you cannot do “backup” for initial-sync… but why would you?

Can we take one step back and clarify what’s the use case you’re trying to implement? Why do you want to restore just 1 out of 3 nodes? Something is broken or that’s a “recovery” procedure for any node? Something else?

hemanth · October 27, 2025, 11:23am

Right now, we are performing a POC with the goal of taking a full data backup using PBM. After that, we plan to enable incremental PITR to take hourly backups. During the POC, we successfully pushed the data to S3 through PBM. However, when we tried to restore the same data after enabling PITR, the data restoration completed successfully, but all mongod processes on both the primary and secondary servers failed. We have also configured a replica set between the primary and secondary servers to ensure data synchronization.

radoslaw.szulgo · October 29, 2025, 9:36am

Can you provide logs from at least 1 server? What error is shown there?

hemanth · October 30, 2025, 12:12pm

The procedure I followed for the restore was: I removed the MongoDB data directory, but after doing so the mongod service failed to start. I started it again and then executed the restore command pbm restore 2025-10-30T07:01:38Z. A few minutes later, the mongod service went inactive. I’ve also attached the last few logs.

2025-10-30T05:51:17.000+0000 D [restore/2025-10-30T05:38:05.927063022Z] download stat: buf 536870912, arena 268435456, span 33554432, spanNum 8, cc 2, [{2 0} {2 0}]
2025-10-30T05:51:17.000+0000 I [restore/2025-10-30T05:38:05.927063022Z] preparing data
2025-10-30T05:51:35.000+0000 D [restore/2025-10-30T05:38:05.927063022Z] oplogTruncateAfterPoint: {1761737952 1}
2025-10-30T05:51:37.000+0000 I [restore/2025-10-30T05:38:05.927063022Z] recovering oplog as standalone
2025-10-30T05:51:54.000+0000 I [restore/2025-10-30T05:38:05.927063022Z] clean-up and reset replicaset config
2025-10-30T05:52:06.000+0000 D [restore/2025-10-30T05:38:05.927063022Z] uploading “.pbm.restore/2025-10-30T05:38:05.927063022Z/rs.poc/node.10.200.10.98:27018.hb” [size hint: 10 (10.00B); part size: 10485760 (10.00MB)]
2025-10-30T05:52:06.000+0000 D [restore/2025-10-30T05:38:05.927063022Z] uploading “.pbm.restore/2025-10-30T05:38:05.927063022Z/rs.poc/rs.hb” [size hint: 10 (10.00B); part size: 10485760 (10.00MB)]
2025-10-30T05:52:06.000+0000 D [restore/2025-10-30T05:38:05.927063022Z] uploading “.pbm.restore/2025-10-30T05:38:05.927063022Z/cluster.hb” [size hint: 10 (10.00B); part size: 10485760 (10.00MB)]
2025-10-30T05:52:12.000+0000 D [restore/2025-10-30T05:38:05.927063022Z] dropping ‘admin.pbmAgents’
2025-10-30T05:52:12.000+0000 D [restore/2025-10-30T05:38:05.927063022Z] dropping ‘admin.pbmBackups’
2025-10-30T05:52:12.000+0000 D [restore/2025-10-30T05:38:05.927063022Z] dropping ‘admin.pbmRestores’
2025-10-30T05:52:12.000+0000 D [restore/2025-10-30T05:38:05.927063022Z] dropping ‘admin.pbmCmd’
2025-10-30T05:52:12.000+0000 D [restore/2025-10-30T05:38:05.927063022Z] dropping ‘admin.pbmPITRChunks’
2025-10-30T05:52:12.000+0000 D [restore/2025-10-30T05:38:05.927063022Z] dropping ‘admin.pbmPITR’
2025-10-30T05:52:12.000+0000 D [restore/2025-10-30T05:38:05.927063022Z] dropping ‘admin.pbmOpLog’
2025-10-30T05:52:12.000+0000 D [restore/2025-10-30T05:38:05.927063022Z] dropping ‘admin.pbmLockOp’
2025-10-30T05:52:12.000+0000 D [restore/2025-10-30T05:38:05.927063022Z] dropping ‘admin.pbmLock’
2025-10-30T05:52:12.000+0000 D [restore/2025-10-30T05:38:05.927063022Z] dropping ‘admin.pbmLock’
2025-10-30T05:52:12.000+0000 D [restore/2025-10-30T05:38:05.927063022Z] dropping ‘admin.pbmLog’
2025-10-30T05:52:14.000+0000 I [restore/2025-10-30T05:38:05.927063022Z] restore on node succeed
2025-10-30T05:52:14.000+0000 I [restore/2025-10-30T05:38:05.927063022Z] moving to state done
2025-10-30T05:52:14.000+0000 D [restore/2025-10-30T05:38:05.927063022Z] uploading “.pbm.restore/2025-10-30T05:38:05.927063022Z/rs.poc/node.10.200.10.98:27018.done” [size hint: 10 (10.00B); part size: 10485760 (10.00MB)]
2025-10-30T05:52:14.000+0000 I [restore/2025-10-30T05:38:05.927063022Z] waiting for done status in rs map[.pbm.restore/2025-10-30T05:38:05.927063022Z/rs.poc/node.10.200.10.104:27018:{} .pbm.restore/2025-10-30T05:38:05.927063022Z/rs.poc/node.10.200.10.94:27018:{} .pbm.restore/2025-10-30T05:38:05.927063022Z/rs.poc/node.10.200.10.98:27018:{}]
2025-10-30T05:52:19.000+0000 D [restore/2025-10-30T05:38:05.927063022Z] uploading “.pbm.restore/2025-10-30T05:38:05.927063022Z/rs.poc/rs.done” [size hint: 10 (10.00B); part size: 10485760 (10.00MB)]
2025-10-30T05:52:19.000+0000 I [restore/2025-10-30T05:38:05.927063022Z] waiting for shards map[.pbm.restore/2025-10-30T05:38:05.927063022Z/rs.poc/rs:{}]
2025-10-30T05:52:24.000+0000 D [restore/2025-10-30T05:38:05.927063022Z] uploading “.pbm.restore/2025-10-30T05:38:05.927063022Z/cluster.done” [size hint: 10 (10.00B); part size: 10485760 (10.00MB)]
2025-10-30T05:52:24.000+0000 I [restore/2025-10-30T05:38:05.927063022Z] waiting for cluster
2025-10-30T05:52:29.000+0000 D [restore/2025-10-30T05:38:05.927063022Z] converged to state done
2025-10-30T05:52:29.000+0000 D [restore/2025-10-30T05:38:05.927063022Z] uploading “.pbm.restore/2025-10-30T05:38:05.927063022Z/rs.poc/stat.10.200.10.98:27018” [size hint: 73 (73.00B); part size: 10485760 (10.00MB)]
2025-10-30T05:52:29.000+0000 I [restore/2025-10-30T05:38:05.927063022Z] writing restore meta
2025-10-30T05:52:29.000+0000 W [restore/2025-10-30T05:38:05.927063022Z] meta .pbm.restore/2025-10-30T05:38:05.927063022Z.json already exists, trying write done status with ‘’
2025-10-30T05:52:29.000+0000 D [restore/2025-10-30T05:38:05.927063022Z] rm tmp conf
2025-10-30T05:52:29.000+0000 D [restore/2025-10-30T05:38:05.927063022Z] wait for cluster status
2025-10-30T05:52:34.000+0000 D [restore/2025-10-30T05:38:05.927063022Z] no cleanup strategy to apply
2025-10-30T05:52:34.000+0000 I [restore/2025-10-30T05:38:05.927063022Z] recovery successfully finished
2025-10-30T05:52:34.000+0000 I change stream was closed
2025-10-30T05:52:34.000+0000 D [restore/2025-10-30T05:38:05.927063022Z] hearbeats stopped
2025-10-30T05:52:34.000+0000 I Exit:
2025-10-30T05:52:34.000+0000 D [agentCheckup] deleting agent status
2025-10-30T05:53:05.000+0000 E Exit: connect to PBM: create mongo connection: ping: server selection error: server selection timeout, current topology: { Type: Unknown, Servers: [{ Addr: 127.0.0.1:27018, Type: Unknown, Last error: dial tcp 127.0.0.1:27018: connect: connection refused }, ] }
2025-10-30T05:53:35.000+0000 E Exit: connect to PBM: create mongo connection: ping: server selection error: server selection timeout, current topology: { Type: Unknown, Servers: [{ Addr: 127.0.0.1:27018, Type: Unknown, Last error: dial tcp 127.0.0.1:27018: connect: connection refused }, ] }

radoslaw.szulgo · October 30, 2025, 12:18pm

This states that the PBM agent connection to the PSMDB was unexpectedly interrupted. Can you please provide logs from the server so we can see what happened on the server side at that time?

hemanth · November 7, 2025, 1:02pm

which logs require syslog?

radoslaw.szulgo · November 12, 2025, 11:49am

I don’t understand the question. Sorry…

hemanth · November 12, 2025, 11:54am

You have asked for server logs we have many logs, so could you please specify which ones are required?

radoslaw.szulgo · November 12, 2025, 12:58pm

Ah, that’s clear now. I asked precisely about Percona Server for MongoDB logs:

By default under: /var/log/mongodb/server1.log*

hemanth · November 13, 2025, 4:34am

Below are the MongoDB logs taken during the restoration.

mongod.log (60.9 KB)

Topic		Replies	Views
Mongod-failure-after-restore Percona Operator for MongoDB	2	48	October 15, 2025
mongod-failure-after-restore Percona Server for MongoDB	1	18	October 10, 2025
Error Percona Restore Percona Backup for MongoDB	14	2079	April 15, 2021
Restore process hanged up Percona Backup for MongoDB	9	1161	March 12, 2021
Pbm mongo restore on aks Percona Backup for MongoDB percona	5	1371	December 17, 2021

Mongod-failure-after-restore

PBM status after backup:

PITR incremental backup:

Currently running:

Backups:

PBM status after restore:

PITR incremental backup:

Currently running:

Backups:

Related topics