Pbm oplog-replay skips entire endpoint second; --end timestamp behaves inconsistently (PBM 2.12.0) | failing reconciliation tests

Hello there,
I’m running PITR restore tests on one of our clusters and I’ve hit behaviour in pbm oplog-replay that looks wrong, or at least very unexpected. I’ll try to describe it as clearly as I can.

The goal here is simple:
validate that a snapshot-based recovery + PITR oplog replay ends in the exact same logical state as a “reference DB” restored from a newer snapshot.
Both DBs should deterministically converge at the same point-in-time.

What I do is:

  1. Restore snapshot X-1 into a standalone single-node RS (TARGET DB).
    On first startup after restore, Mongo emits the usual WiredTiger recovery lines — I read the log and extract the RS endpoint timestamp (epoch). Let’s call this TARGET_START_EPOCH.

  2. Restore snapshot X into a separate reference DB (REFDB).
    Same process — grab its endpoint epoch from the Mongo log. This becomes my intended PITR endpoint, let’s call it REFDB_END_EPOCH.

  3. Then I do PITR as expected:

pbm oplog-replay
–start <TARGET_START_EPOCH>
–end <REFDB_END_EPOCH>

  1. I expect PBM to apply everything including the ops that happen at exactly REFDB_END_EPOCH, so that TARGET+PITR matches REFDB exactly.

Unfortunately, this is not what happens.
Observed issue

After replay, my reconciliation tool shows visible mismatches in multiple busy collections (counts + storage stats differ). So I dug into the PITR chunks directly.

  • One particular chunk, covering e.g. 08:06:10–08:07:10, contains a lot of ops in the boundary second.

  • In my test case, at timestamp 1764576382 (just before my intended end), there are 8,9k ops total, roughly 1.5k in affected namespaces (updates/inserts).

  • None of these were applied by PBM, even though the --end I passed should have included them.

PBM reported that replay “finished on” {ts : 1764576382}, but my requested end time corresponds to timestamp 1764576383. So it stops one second short.

Given how oplog entries behave (same second, increment increasing), this makes it look like PBM is using something like:

ts <= (endSec, 0)

and since real oplog ops inside that second usually have increment >= 1, they all get treated as “after end” and skipped.

So effectively:

--end = 08:06:23
→ PBM replays ops only up to 08:06:22.

This is exactly the off-by-one second gap causing mismatches in our PITR consistency test.

This behaviour is very easy to reproduce on oplog-heavy workloads where many ops land in the same second.

Environment

  • PBM version: 2.12.0

  • PSMDB backend (single-node RS for restore-testing)

  • Oplog chunking: 1-minute slices, avg size 41–56 MB per chunk

  • Collections are high-traffic (thousands of ops per second)

  • PITR is enabled and working otherwise correctly

Expected behaviour

If I pass:

--end 2025-12-01T08:06:23

I expect PBM to replay:

  • All ops up to and including the ones that belong to this wall-clock second.

This is the common-sense interpretation for PITR and also the only way recon tests can be reliable.
Actual behaviour

PBM appears to construct the internal end timestamp as:

(T = endSec, I = 0)

and since oplog entries for that second have (T = endSec, I >= 1), PBM sees them as greater than endTS and stops without applying them.

So the actual slice looks like:

startTS <= ts <= (endSec,0)

which in real terms ends up being:

startTS <= ts < endSec

This is definitely not what users expect from PITR.
Workaround (what I’m doing now)

If I pass:

--end = REFDB_END_EPOCH + 1 second

then PBM sets endTS = (endSec+1, 0)
→ all ops with T <= endSec get applied
→ ops with (T = endSec+1, I > 0) are still rejected
→ no data corruption or overshoot occurs.

This gives me correct matching results for the recon check.

But this is not intuitive and not documented anywhere, so other users can absolutely fall into the same trap.

What I kindly ask from Percona devs

  • Please confirm the intended semantics of --end.

  • If the behaviour is meant to be exclusive, then documentation should state clearly:

    “PBM interprets --end as a strict upper bound; use end+1s if you want inclusive PITR behavior.”

  • If the behaviour is not intentional, then adjusting the internal timestamp (e.g., constructing (endSec, MaxIncrement) instead of (endSec, 0)) would immediately fix the gap.

  • Alternatively: expose a flag like --end-inclusive, or document best practices.

This issue is particularly visible on high-write workloads where many operations share the exact same second.

I’m attaching anonymised snippets of:

  • PITR chunk header for the affected minute

  • selection of representative oplog entries inside the boundary second

  • PBM logs with “finished on” timestamp

  • recon mismatch metadata summarizing the drift between TARGET and REFDB

Let me know if you want the full oplog slice or replay logs.

Just for more clarification…

we do external snapshot based PITR recovery from consistent snapshot being “captured” while DB was in opened backup-cursor “backupReady” state. We utilise enterprise features of HDD hypervisor or something… which captures guaranteed consistent snapshot to the point in time.

afterwards when DBs (X-1 - Target db) and (X - Reference db for reconciliation) are started, endPoint timestamps resulting from WT recovery are captured and used in later phase for PITR oplog-replay recovery.

so we use this combination of external snapshot based backup + PITR oplogonly made 1 minute slices which are huge in size.

to speedup the oplog-replay process, I set storage.journal.commitIntervalMs to max value of 500 - it had practical impact on applyOps processing time, reducing it to more acceptable recovery time objective.

however as average size of chunks is in this particular case (oplog heavy business instance) 45-50 ~ MBs in size it still takes long time to process.

in final reconciliation test where I compare side by side metadata stats for all non-system mongodb colls (in non-system namespaces only) i saw logical mismatches in docs counts and data sizes which is by reconciliation test considered as unacceptable incomplete state.
by further analysing and decompressing xxx.snappy 1-min. oplog chunk we found the EXACT missing docs in the last epoch second $END timestamp. these were skipped.

If needed I can provide more details for this case.
I hope someone will notice this thread as because this issue is of concern when real productions depend on PBM utilities and expect times to be reliable.
If needed I can provide exact actual evidence for reproduction.

sidenote: right now we can cope with it by setting END time always + 1 second as an insurance, in order to always apply operations exactly to required endpoint second to the last exact operation matching the given point in time.. not sure whether this is actually the best practice but we need to have guaranteed operation-level consistency.

Thank you much in advance.
Thanks in advance

Hello,
Thank you for sharing your use case; it’s very interesting one, to say the least.

I can confirm that your explanation is 100% correct. Concretely, when pbm oplog-replay command has e.g:
--end 2025-12-01T08:06:22
that means exactly:
--end <= (1764576382,0) or more readable (2025-12-01T08:06:22,0)

We should explain that more precisely within the documentation, but that’s expected behavior.

A hint that might be useful is the undocumented possibility of using MongoDB’s timestamp value directly, so (epoch, ordinal) pair.
So for the above example, it’ll be possible to use:
--end “1764576382,0” or --end “1764576382,100”, so it’s possible to specify up to whatever operation you prefer to apply oplog.

1 Like

Hello Boris,

Thank you much for your response.. That’s great news for us. I will adapt my code then and implement it. My bad I didn’t check your code-base in more detail and missed this one :slight_smile: .

So now we can rollback even when someone executes desctructive command on the db to the exact operation preceeding the bad one.