Mongo Point-In-Time restore using EBS Snapshots

Hello Team,

Environment Details:
Mongo Server: v6.0.4
Operating System: Amazon Linux 2
Mode: Replicaset with 3 nodes (1 Primary and 2 secondary)

Goal:
To do point in time restore of Mongo Server using T-1 EBS volume snapshot

Background:
We create EBS Volume snapshots of /data mount point of Primary Mongo Node every day at 23:50 hours.

Steps we have done:

  1. Create new EC2 instances with mongo configuration
  2. Create EBS Volume for /data mount point from T-1 EBS Snapshot
  3. Mount the new volume to /data
  4. Launch the mongo service

Now we would like to restore the data to new Ec2 instance from existing running cluster till 1 hour ago.

We were able to do the same using mongorestore; but it took around 4 hours to replay the oplog of 4GB; which is not ideal for us in production environments.

So, we are looking to pbm tool as an option; but we are not sure where to start after launching the new mongo node from T-1 snapshot.

Any help would be highly appreciated.

Thanks

Hi! Can you share some more details?
We’ve just released PBM 2.2.0 that:

  1. enhances PITR with physical backups. The PITR is still a logical oplog replay
  2. announced the Snapshot CLI support with PBM.

Hello @Jan_Wieremjewicz

I am not sure what more details is required. So, sharing as much details as possible. Thank you for your assistance.

Few more details on the environment:

  • T-1 means the snapshot of the previous day.
  • Current running Mongo cluster contains oplog for last 57 hours.
  • Data in the Mongo Cluster is around 400GBs.
  • Stop the application, as soon as we detect something went wrong with the data.

What we have tested as of now:

  • launch new ec2 instance with T-1 EBS volume snapshot of /data volume.
  • Get the last oplog timestamp from the newly lanuched instance
  • Take oplog backup from the existing Mongo cluster using below command
mongodump -u admin --authenticationDatabase=admin -h test-mongo.something.com -d local -c oplog.rs --query '{"ts": {"$gt": {"$timestamp": {"t": 1689689931, "i": 7}}}}' -o oplogDumpDir/
  • Then replayed the oplog on the newly launched EC2 instance up until one hour ago using oplogLimit with mongorestore
mongorestore  --authenticationDatabase=admin -u root -p <password> -h 127.0.0.1  --oplogReplay --numInsertionWorkersPerCollection=50 --numParallelCollections=50 --oplogLimit 1689837570:1 oplogDumpDir/
  • After recovery is completed; re-initialise the replicaset in the new node; take snapshot of /data EBS Volume and attach it to the existing secondary nodes and then add them to the new node making it primary.

This works fine; but takes a lot of time to recover the data until the specified time.

The above approach works fine; but it takes a lot of time to perform the point in time restore of the data from T-1 EBS volume snapshot.

What we are looking now; is for a better approach that helps to achieve point in time restore of the data faster to minimise the downtime of applications as much as possible.

We were unable to find any guide on how we can use PBM CLI to replay the oplog backup on the EBS snapshot.

Hi Ritesh,

Are you running percona mongodb or mognodb enterprise/community version?

If you are running PSMDB, have you already configured the PBM to take backups along with pitr before the issue happened?

Regards

Hello @Santosh_Varma

We are running MongoDB Community version. No, we have not yet implemented PBM tool to take backups.

Currently we only take EBS snapshots of /data mount point every night.

Hi @ritesh12

What I would recommend, is to give Incremental physical backups a try. You can do a full backup once in 24h and then increments let’s say every hour. So you’d have backups with 1h precision and no need for the oplog reply. Just consider that physical restore going to be slower than the snapshot. But definitely way more faster than oplog replay. If you need more precision, you can either do increments in more frequent pace (e.g. every 30min) or add PITR. It is still gonna be an oplog reply but of much less data.

I’d also recommend to try PBM Snapshot-based physical backups. The main difference with “manual” snapshots for non-sharded clusters is that you don’t have to lock a database for the snapshot. I reckon you do something like db.fsyncLock() before the snapshot to ensure data won’t change while it’s copying. But it pends all writes. PBM on the contrary can ensure that data won’t be changed during the copy while the database is run in a usual mode accepting new writes.

Just keep in mind that for any physical/incremental/snapshot-based backups you need PSMDB as the Community Mongo doesn’t have backupsCursors which PBM uses for these types of backups.

1 Like