Scheduled backups aren't deleted, pitr is not continuous


I turned on backups’ stuff alongside pitr, than I created a full (on-demand) backup (2023-07-03T05:56:49Z) as a base backup for pitr, than I set scheduled backups with cron (2023-07-03T07:55:58Z - is the first scheduled backup).


  enabled: true
    - name: {{name}}
      storageName: main        
      enabled: true
      keep: 2        
      schedule: */5 * * * *      
      type: logical
    enabled: true
    oplogSpanMin: 3
        bucket: {{bucket}}
        credentialsSecret: {{credentialsSecret}}
        endpointUrl: {{endpointUrl}}
        prefix: {{prefix}}
        region: {{region}}
      type: s3

After some time I have the following actual result:

Backup snapshots:
  2023-07-03T05:56:12Z <logical> [restore_to_time: 2023-07-03T05:56:49Z]
  2023-07-03T07:55:21Z <logical> [restore_to_time: 2023-07-03T07:55:58Z]
  2023-07-03T08:25:21Z <logical> [restore_to_time: 2023-07-03T08:25:31Z]
  2023-07-03T09:00:41Z <logical> [restore_to_time: 2023-07-03T09:00:49Z]
  2023-07-03T09:15:41Z <logical> [restore_to_time: 2023-07-03T09:15:51Z]
  2023-07-03T09:25:47Z <logical> [restore_to_time: 2023-07-03T09:26:32Z]
  2023-07-03T09:30:56Z <logical> [restore_to_time: 2023-07-03T09:31:07Z]
  2023-07-03T09:40:55Z <logical> [restore_to_time: 2023-07-03T09:41:04Z]

As you can see keep: 2 doesn’t work, because at this case it would have been only three backups (expected result):

Backup snapshots:
  2023-07-03T05:56:12Z <logical> [restore_to_time: 2023-07-03T05:56:49Z]
  2023-07-03T09:30:56Z <logical> [restore_to_time: 2023-07-03T09:36:07Z]
  2023-07-03T09:40:55Z <logical> [restore_to_time: 2023-07-03T09:41:04Z]

At operator logs I see the following errors:
2023-07-03T09:50:40.305Z ERROR failed to run finalizer {"controller": "perconaservermongodbbackup-controller", "object": {"name":"cron-cloud-documentdb-20230703092500-xppp2","namespace":"documentdb40"}, "namespace": "documentdb40", "name": "cron-cloud-documentdb-20230703092500-xppp2", "reconcileID": "0c3ea5d8-0d7a-4930-a6b9-2e12b99da437", "finalizer": "delete-backup", "error": "unable to delete: backup is a base for '2023-07-03T09:26:32 - 2023-07-03T09:41:04'", "errorVerbose": "unable to delete: backup is a base for '2023-07-03T09:26:32 - 2023-07-03T09:41:04'\*PBM).probeDelete\n\t/go/pkg/mod/\*PBM).DeleteBackup\n\t/go/pkg/mod/\*ReconcilePerconaServerMongoDBBackup).checkFinalizers\n\t/go/src/\*ReconcilePerconaServerMongoDBBackup).Reconcile\n\t/go/src/\*Controller).Reconcile\n\t/go/pkg/mod/\*Controller).reconcileHandler\n\t/go/pkg/mod/\*Controller).processNextWorkItem\n\t/go/pkg/mod/\*Controller).Start.func2.2\n\t/go/pkg/mod/\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1594"}*ReconcilePerconaServerMongoDBBackup).checkFinalizers /go/src/*ReconcilePerconaServerMongoDBBackup).Reconcile /go/src/*Controller).Reconcile /go/pkg/mod/*Controller).reconcileHandler /go/pkg/mod/*Controller).processNextWorkItem /go/pkg/mod/*Controller).Start.func2.2 /go/pkg/mod/

Moreover pitr ranges have gaps:

 PITR <on>:
  2023-07-03T05:56:50Z - 2023-07-03T07:50:28Z
  2023-07-03T07:55:59Z - 2023-07-03T08:10:37Z
  2023-07-03T08:25:32Z - 2023-07-03T08:35:31Z
  2023-07-03T09:00:50Z - 2023-07-03T09:05:57Z
  2023-07-03T09:15:52Z - 2023-07-03T09:20:48Z
  2023-07-03T09:26:33Z - 2023-07-03T09:41:04Z

I expected the only one continuous range:

 PITR <on>:
  2023-07-03T05:56:50Z - 2023-07-03T09:41:04Z

At a backup-agent pod I see logs:

2023-07-03T07:53:38.000+0000 E [pitr] streaming oplog: epoch mismatch. Got sleep in {1688370631 9}, woke up in {1688370639 1}. Too old for that stuff.
2023-07-03T07:55:54.000+0000 E [pitr] streaming oplog: get backup start TS: run out of tries


P.S. One question aside - what is a proper way to choose specific backup as a base for pitr when I want to make restore?


Hello @SlavaUtesinov ,

thanks for sharing it. Well written and quite easy to understand.

I see that your schedule takes backups every 5 mins. I have a feeling that backups might not even complete when the new one starts and this causes the prob/ Do you have some data in the cluster? If so, how much (GBs)?

I will try to reproduce it this or next week.

Hi, @Sergey_Pronin!

My cluster is absolutely empty. Moreover, if I switch on only scheduled backups with the same rate(cron) but without pitr feature, scheduled backups work as expected and keep parameter is strictly honored.

P.S. I was monitoring the cluster for twenty-four hours to catch this behavior.

Ah, makes sense. Got it now.

We have a feature request to support PITR retention: [K8SPSMDB-801] Add support for PITR retention - Percona JIRA

We will add it in 1.15 release.
But I’m still confused a bit with the chunk gaps.