Full and incremental backups conflicting and stopping all backups

seacom · September 16, 2024, 12:45am

Description:

I have incremental backups scheduled at 5 minute intervals and a nightly full backup. For the past several months this is worked great, the pgbackrest lock file has always conflicted causing a failed job somewhere but the operator would run the backup again and the issue would resolve it self.

However the addition of the latestrestorabletime update doesn’t play well with these as the conflicted backups produce a perconapgbackup that can’t verified/reconciled whatever.

I have no idea what happens in the background but this eventually leads to no backup jobs being created.

The operator doesn’t log anything to indicate there is a fault and backup jobs stop being created eventually, prior to that the backup jobs just log
“HINT: is another pgBackRest process running?”. And the operator just loops over “Triggering PGBackup reconcile” and "Latest commit timestamp " with “latestRestorableTime”: “”.

There is no actual errors though.

It follows a pattern of exponential failures until backups stop altogether.

Steps to Reproduce:

Schedule full and log backups to s3 to run at the same time for several days or until backup job pods stop being created (perconapgbackup objects will still be created without a status or s3 path).

Version:

2.4.1

Additional Information:

I’m pretty sure several of the other issues regarding backups on this version are just symptoms of this, i experienced a large azure and s3 bill as a result of the operator constantly trying to reconcile non existing backups.

matthewb · September 16, 2024, 2:08pm

Wow. That’s a bit crazy. For the sake of fixing your issue, have you tried changing this to something like 30m? If that fixes the issue, then the original problem was that you are taking incrementals too frequently and causing an internal race condition.

seacom · September 16, 2024, 10:01pm

Honestly I don’t think it is. 5 minutes is standard for most databases. Not to mention the zalando operator with WALG never had an issue keeping up.

I have actually. It did not. It just delayed the time it took to get there as its an exponential problem. Because the issue is the incremental and full running at the same time and competing for the pgackrest lock file, when 1 fails and can’t get the lock the operator restarts the job until it does, however the backups that failed generate perconapgbackup objects which the operator gets stuck in a loop trying to reconcile because the perconapgbackup objects have no status of them.

This wasn’t an issue until latestrestorabletime was added.

matthewb · September 17, 2024, 4:35am

As someone who has worked on literally thousands of MySQL servers, across all industries (Medical, PCI, Fintech, gaming…) I can legitimately say, no, 5 min interval backups is absolutely not the standard. That would cause a crazy amount of overhead.

The most frequent interval backups I’ve seen were every 8hrs. Standard, is more like daily fulls with 5m rsync of the binary logs (or standing up a binlog server to receive immediate transaction copies). Multi-TB typically do a full once a week, then daily intervals, plus the binlog sync. MySQL binlogs are similar to PGSQL’s WAL archiving. I’ll ask one of my PGSQL consultants to chime in on this best practice.

Regarding your current issue, I’ll ping our operators tech lead and see if he has any insights.

Slava_Sarzhan · September 17, 2024, 6:57am

Hi @seacom, We will try to reproduce this issue and the same time starting from v2.5.0 it will be possible to disable “latestRestorableTime” tracking via percona-postgresql-operator/deploy/cr.yaml at main · percona/percona-postgresql-operator · GitHub option. I hope that we will have a release next week.

seacom · September 17, 2024, 11:28am

Awesome thanks.

Perhaps I am confusing continuous archiving with incremental backups.

Topic		Replies	Views
Cron job issue for postgresql backup to s3 Percona Operator for PostgreSQL	3	271	October 14, 2024
Pgbackrest problem with operator 1.4.1 Percona Operator for PostgreSQL	3	1096	May 30, 2023
Backup fails after helm upgrade Percona Operator for PostgreSQL	1	175	November 16, 2024
Configuring retention times in cr.yaml Percona Operator for PostgreSQL	2	91	September 17, 2024
Backup jobs not created anymore Percona Operator for PostgreSQL	4	154	September 13, 2024

Full and incremental backups conflicting and stopping all backups

Description:

Steps to Reproduce:

Version:

Additional Information:

Related topics