Hello there,
I’m running PITR restore tests on one of our clusters and I’ve hit behaviour in pbm oplog-replay that looks wrong, or at least very unexpected. I’ll try to describe it as clearly as I can.
The goal here is simple:
validate that a snapshot-based recovery + PITR oplog replay ends in the exact same logical state as a “reference DB” restored from a newer snapshot.
Both DBs should deterministically converge at the same point-in-time.
What I do is:
-
Restore snapshot X-1 into a standalone single-node RS (TARGET DB).
On first startup after restore, Mongo emits the usual WiredTiger recovery lines — I read the log and extract the RS endpoint timestamp (epoch). Let’s call thisTARGET_START_EPOCH. -
Restore snapshot X into a separate reference DB (REFDB).
Same process — grab its endpoint epoch from the Mongo log. This becomes my intended PITR endpoint, let’s call itREFDB_END_EPOCH. -
Then I do PITR as expected:
pbm oplog-replay
–start <TARGET_START_EPOCH>
–end <REFDB_END_EPOCH>
- I expect PBM to apply everything including the ops that happen at exactly
REFDB_END_EPOCH, so that TARGET+PITR matches REFDB exactly.
Unfortunately, this is not what happens.
Observed issue
After replay, my reconciliation tool shows visible mismatches in multiple busy collections (counts + storage stats differ). So I dug into the PITR chunks directly.
-
One particular chunk, covering e.g.
08:06:10–08:07:10, contains a lot of ops in the boundary second. -
In my test case, at timestamp
1764576382(just before my intended end), there are 8,9k ops total, roughly 1.5k in affected namespaces (updates/inserts). -
None of these were applied by PBM, even though the
--endI passed should have included them.
PBM reported that replay “finished on” {ts : 1764576382}, but my requested end time corresponds to timestamp 1764576383. So it stops one second short.
Given how oplog entries behave (same second, increment increasing), this makes it look like PBM is using something like:
ts <= (endSec, 0)
and since real oplog ops inside that second usually have increment >= 1, they all get treated as “after end” and skipped.
So effectively:
--end = 08:06:23
→ PBM replays ops only up to 08:06:22.
This is exactly the off-by-one second gap causing mismatches in our PITR consistency test.
This behaviour is very easy to reproduce on oplog-heavy workloads where many ops land in the same second.
Environment
-
PBM version: 2.12.0
-
PSMDB backend (single-node RS for restore-testing)
-
Oplog chunking: 1-minute slices, avg size 41–56 MB per chunk
-
Collections are high-traffic (thousands of ops per second)
-
PITR is enabled and working otherwise correctly
Expected behaviour
If I pass:
--end 2025-12-01T08:06:23
I expect PBM to replay:
- All ops up to and including the ones that belong to this wall-clock second.
This is the common-sense interpretation for PITR and also the only way recon tests can be reliable.
Actual behaviour
PBM appears to construct the internal end timestamp as:
(T = endSec, I = 0)
and since oplog entries for that second have (T = endSec, I >= 1), PBM sees them as greater than endTS and stops without applying them.
So the actual slice looks like:
startTS <= ts <= (endSec,0)
which in real terms ends up being:
startTS <= ts < endSec
This is definitely not what users expect from PITR.
Workaround (what I’m doing now)
If I pass:
--end = REFDB_END_EPOCH + 1 second
then PBM sets endTS = (endSec+1, 0)
→ all ops with T <= endSec get applied
→ ops with (T = endSec+1, I > 0) are still rejected
→ no data corruption or overshoot occurs.
This gives me correct matching results for the recon check.
But this is not intuitive and not documented anywhere, so other users can absolutely fall into the same trap.
What I kindly ask from Percona devs
-
Please confirm the intended semantics of
--end. -
If the behaviour is meant to be exclusive, then documentation should state clearly:
“PBM interprets
--endas a strict upper bound; use end+1s if you want inclusive PITR behavior.” -
If the behaviour is not intentional, then adjusting the internal timestamp (e.g., constructing
(endSec, MaxIncrement)instead of(endSec, 0)) would immediately fix the gap. -
Alternatively: expose a flag like
--end-inclusive, or document best practices.
This issue is particularly visible on high-write workloads where many operations share the exact same second.
I’m attaching anonymised snippets of:
-
PITR chunk header for the affected minute
-
selection of representative oplog entries inside the boundary second
-
PBM logs with “finished on” timestamp
-
recon mismatch metadata summarizing the drift between TARGET and REFDB
Let me know if you want the full oplog slice or replay logs.