PITR restore performance

A follow up from Restore on a local docker deployment - #7 by Adraen I’m opening up a new thread because that’s a different discussion (hope that’s okay)

So I have a large-ish DB (~20GB compressed) that heavily uses timeseries DB. I’ve enabled daily logical snapshot and PITR.

However I’m running into absolutely terrible restore performance for PITR, my computer is mostly doing nothing while the oplog restore is crawling. I’m doing everything locally to remove as many bottlenecks as possible (no S3, no network, no encryption)

I got codex to give me some metrics on the restore

  • the snapshot recovery took 9 minutes and 9 seconds (that’s nice and fast)
  • the oplog replay is dead slow, It’s about 1.2x realtime, to replay 18seconds I need 15 seconds, so current estimated restore time is 17hours for 20 hours of PITR slices

My host machine is really not saturated:

  • Mongo CPU: ~7-9%
    PBM CPU: ~3-4%
    RAM available: ~23 GiB
    Swap: full but not actively swapping
    I/O wait: ~6%

Codex’s conclusion is

MongoDB logs show the main delay is write latency, not compute. During restore, Mongo logged slow time-series bucket writes using majority write concern, with
waitForWriteConcernDurationMillis commonly around 180-408ms. The workload is applying many small applyOps operations against time-series bucket collections such as
*.system.buckets.data.

The restore is therefore latency-bound and replay-order-bound. PBM/Mongo are not able to use all CPU cores or disk bandwidth during oplog replay, so aggregate system
metrics look quiet even though wall-clock progress is slow.

Additional contributing factors:

  • Time-series bucket updates cause write amplification.
  • WiredTiger is checkpointing regularly, which is normal but adds background write work.
  • The oplog cap maintainer is trimming a ~9 GB oplog, though logs do not show this as a major delay.
  • MongoDB Compass was running full collection scans/counts during the restore; this has now been closed.

PBM tuning options are limited for this phase. --num-parallel-collections and --num-insertion-workers-per-collection can help snapshot import, but the snapshot phase
already completed quickly. PBM does not expose an obvious oplog replay parallelism knob in version 2.12.0.


From this it looks like the issue might simply be the latency of each op, although what I find odd is that I’ve used mongorestore --replayOp and it was much faster than this I believe (my oplog is long enough that I might A/B test PBM and mongorestore in the weekend)

Am I missing something by any chance? I doubt eveybody is having such poor restore performances as it makes it pretty much unusable.

So I did some tests this morning with mongorestore --oplogReply, it’s about 2.5x faster.

It’s better but it’s still terribly slow and unpractical (and again my PC is barely doing anything) so I guess for the time being no PITR for me :frowning:

If anyone has any suggestion on what I might be doing wrong, or what I could be doing differently that would be great

Hi and thank you for analyzing and sharing all the data.

The first thing we need to emphasize is that PITR restore is the slowest restore strategy that PBM supports. All others are significantly faster. But it allows restore of the data from the PITR backup at an arbitrary point in time with the second precision, and that is its main benefit.

A high-level technical explanation of what’s happening there and why your system has low utilization can be described in terms of how PITR restore works: it applies the applyOps command (https://www.mongodb.com/docs/manual/reference/command/applyOps/) for each oplog entry found in PITR backup. PBM executes that command sequentially (in the only possible way), and that’s the main reason why we have such performance, meaning that during PITR restore phase, all entries are applied one by one. We created this investigation ticket (https://perconadev.atlassian.net/browse/PBM-1782), in which we might improve performance compared to mongotools, but this will not significantly speed up the process. Very similar resource utilization, you should have if you create a single document within the loop.

Thanks Boris for the details, I understand the limitation but I’m surprised that none of the resources are a bottleneck (I was expecting a single core at 100%)

I will review my backup strategy to see what would be better, I went back to daily snapshots at the moment but it’s not ideal.

Thanks for your time