Initially my restore script was working fine, however after a few weeks it stopped working and provides the below error - I was not able to find any results or solutions
I’m using local storage with sshfs mounted storage box for this, so I’ve thought it might be a storage bottleneck however that seems not to be the case
Starting restore 2025-05-13T18:28:08.266159982Z to point-in-time 2025-05-13T14:42:31 from '2025-04-24T12:01:21Z'2025-05-13T18:28:08.000+0000 D [pitr] error while executing pitr, pitr procedure will be restarted
mongo-1 | 2025-05-13T18:28:09.000+0000 I got command restore [name: 2025-05-13T18:28:08.266159982Z, snapshot: 2025-04-24T12:01:21Z point-in-time: <1747147351,0>] <ts: 1747160888>, opid: 68238f3809fe636ddb8c3063
mongo-1 | 2025-05-13T18:28:09.000+0000 I got epoch {1747160554 6105}
mongo-1 | 2025-05-13T18:28:09.000+0000 I [restore/2025-05-13T18:28:08.266159982Z] to time: 2025-05-13T14:42:31Z
mongo-1 | 2025-05-13T18:28:09.000+0000 I [restore/2025-05-13T18:28:08.266159982Z] oplog slicer disabled
mongo-1 | 2025-05-13T18:28:09.000+0000 I [restore/2025-05-13T18:28:08.266159982Z] backup: 2025-04-24T12:01:21Z
mongo-1 | 2025-05-13T18:28:09.000+0000 I [restore/2025-05-13T18:28:08.266159982Z] recovery started
mongo-1 | 2025-05-13T18:28:09.000+0000 D [restore/2025-05-13T18:28:08.266159982Z] waiting for 'starting' status
mongo-1 | .........2025-05-13T18:28:17.000+0000 D [pitr] stop pitr hb
mongo-1 | 2025-05-13T18:28:17.000+0000 D [pitr] stop pitr agent activity monitor
mongo-1 | 2025-05-13T18:28:17.000+0000 D [pitr] stop pitr topo monitor
mongo-1 | 2025-05-13T18:28:17.000+0000 D [pitr] stop pitr config monitor
mongo-1 | 2025-05-13T18:28:17.000+0000 D [pitr] stop pitr error monitor
mongo-1 | Error: get metadata: get: context deadline exceeded
It seems some of your pbm-agents might be stuck. Can you try restarting them? which version of pbm are you using?
Hey, thanks for following up on this - yes mongo and the agent has been restarted, and I already tried fully wiping the environment and starting from scratch
After some debugging, i figured out that:
- it’s not an issue with mongodb connection
- it doesn’t seem to be an issue with the storage location
- there are no locks in mongo
- I’m using 3.9.1, and this is the same version I have been using before, and this is also reflected in .pbm.init, so that didn’t change
then I’ve tried running the restore without PITR like this:
pbm restore $latest_snapshot -w
and that worked - so I assume there is something corrupted with the PITR, but how do I debug this? there is no helpful output given from the logs, and the error doesn’t immediately make sense in relation to this
EDIT: also there is only one mongo instance in the cluster:
rs0 [direct: primary] test> ... ... ... ... PBM Config found: [
{
"_id": "68248a1574f6cb3403da0a87",
"storage": {
"type": "filesystem",
"filesystem": {
"path": "/tmp/backup-mount/mongo-backup"
}
},
"pitr": {
"enabled": true,
"oplogSpanMin": 10
},
"epoch": {
"$timestamp": "7504275762761957377"
}
}
]
you can look at pbm-agent log to see more details about what’s going on exactly with PITR. If you suspect a bug please open a report at jira.percona.com
Yes we could submit a bug report
However, I’m not entirely sure what I would go about putting into the bug report? The only output I have is what is shown above, there isn’t any more detailed logs, the log itself seems to be related to a go http request, but when testing it seems to be related to PITR, so I don’t understand what caused it - that’s the reason I’m asking here to see if anyone encountered this before. We only had PBM in production for a month or so and already having issues.
What complicates matters is that there is no utility (as far as I’m aware) that is made available to test the integrity of the backup slices, or any way to debug this, so I can’t confirm what caused this.
start by looking at the pbm agent logs on each node:
journalctl -u pbm-agent -f
there is more detailed output there than you get by running pbm logs alone.