PITR issue - oplog has insufficient range

Hello.

We are using following MongoDB setup:

  1. replica set consisting of three separate servers

  2. each server is running MongoDB version 7.0.5-3

  3. each server is running PBM agent version 2.4.0

  4. environment is built from scratch, during the process a logical base backup is made and PITR is turned on

  5. PITR is using these settings:
    pitr:
    enabled: true
    oplogSpanMin: 10
    compression: gzip
    compressionLevel: -1
    oplogonly: false

  6. oplog size is approximately 1,5GB

  7. logical base backup is made once every day

During the standard operation, PITR gathers chunks every 10 minutes based on configuration. Using “pbm status” command we see something like this:
“Backups:”,
“========”,
“S3 local s3://http:xxxxxxxxxxxxxxxxxxxxxxxxx”,
" Snapshots:“,
" 2024-03-25T07:33:03Z 292.33KB <incremental, base> [restore_to_time: 2024-03-25T07:33:11Z]”,
" PITR chunks [3.07GB]:“,
" 2024-03-25T07:33:12Z - 2024-03-25T10:33:24Z”

But we have come across following situation:

  1. during 10 minute PITR window there’s a large data write to the DB resulting into whole oplog being filled and rotated
  2. when PITR is about to create new chunk, it detects, that whole oplog has rotated and fails with following error:

2024-03-25T10:43:17Z E [xxxx] [pitr] streaming oplog: oplog has insufficient range, some records since the last saved ts {17
11362804 17} are missing. Run pbm backup to create a valid starting point for the PITR

Okay, makes sense - PITR has lost track of some records within the 10 minute span.
How do we make it work again? We have created new base backup, shouldn’t PITR automatically start new chunk chain? See example below:

“Backups:”,
“========”,
“S3 local s3://http:xxxxxxxxxxxxxxxxxxxxxxxxx”,
" Snapshots:“,
" 2024-03-25T11:46:13Z 3.88GB <incremental, base> [restore_to_time: 2024-03-25T11:46:20Z]”, – we made a new backup here
" 2024-03-25T07:33:03Z 292.33KB <incremental, base> [restore_to_time: 2024-03-25T07:33:11Z]“,
" PITR chunks [3.07GB]:”,
" shouldn’t new chunk chain start here using 2024-03-25T11:46:13Z backup? " – shouldn’t new chain start here?
" 2024-03-25T07:33:12Z - 2024-03-25T10:33:24Z"

PITR keeps failing with the same exception, unbothered that there is new base available. Is this expected behaviour? We’d expect a new chunk chain to start.

Thank you for your comments.

Hi,

You’re right, there is an issue on PBM side which will be addressed within PBM-1344 in the upcoming PBM release.

Also for PBM to be able to save oplog between backups, I’d suggest you either increase the oplog size on server side, or decrease oplogSpanMin interval on PBM side.