I have incremental backups scheduled at 5 minute intervals and a nightly full backup. For the past several months this is worked great, the pgbackrest lock file has always conflicted causing a failed job somewhere but the operator would run the backup again and the issue would resolve it self.
However the addition of the latestrestorabletime update doesn’t play well with these as the conflicted backups produce a perconapgbackup that can’t verified/reconciled whatever.
I have no idea what happens in the background but this eventually leads to no backup jobs being created.
The operator doesn’t log anything to indicate there is a fault and backup jobs stop being created eventually, prior to that the backup jobs just log
“HINT: is another pgBackRest process running?”. And the operator just loops over “Triggering PGBackup reconcile” and "Latest commit timestamp " with “latestRestorableTime”: “”.
There is no actual errors though.
It follows a pattern of exponential failures until backups stop altogether.
Steps to Reproduce:
Schedule full and log backups to s3 to run at the same time for several days or until backup job pods stop being created (perconapgbackup objects will still be created without a status or s3 path).
Version:
2.4.1
Additional Information:
I’m pretty sure several of the other issues regarding backups on this version are just symptoms of this, i experienced a large azure and s3 bill as a result of the operator constantly trying to reconcile non existing backups.
Wow. That’s a bit crazy. For the sake of fixing your issue, have you tried changing this to something like 30m? If that fixes the issue, then the original problem was that you are taking incrementals too frequently and causing an internal race condition.
Honestly I don’t think it is. 5 minutes is standard for most databases. Not to mention the zalando operator with WALG never had an issue keeping up.
I have actually. It did not. It just delayed the time it took to get there as its an exponential problem. Because the issue is the incremental and full running at the same time and competing for the pgackrest lock file, when 1 fails and can’t get the lock the operator restarts the job until it does, however the backups that failed generate perconapgbackup objects which the operator gets stuck in a loop trying to reconcile because the perconapgbackup objects have no status of them.
This wasn’t an issue until latestrestorabletime was added.
As someone who has worked on literally thousands of MySQL servers, across all industries (Medical, PCI, Fintech, gaming…) I can legitimately say, no, 5 min interval backups is absolutely not the standard. That would cause a crazy amount of overhead.
The most frequent interval backups I’ve seen were every 8hrs. Standard, is more like daily fulls with 5m rsync of the binary logs (or standing up a binlog server to receive immediate transaction copies). Multi-TB typically do a full once a week, then daily intervals, plus the binlog sync. MySQL binlogs are similar to PGSQL’s WAL archiving. I’ll ask one of my PGSQL consultants to chime in on this best practice.
Regarding your current issue, I’ll ping our operators tech lead and see if he has any insights.