Scheduled backups aren't deleted, pitr is not continuous

Hello!

I turned on backups’ stuff alongside pitr, than I created a full (on-demand) backup (2023-07-03T05:56:49Z) as a base backup for pitr, than I set scheduled backups with cron (2023-07-03T07:55:58Z - is the first scheduled backup).

Configuration:

backup:
  enabled: true
  tasks:
    - name: {{name}}
      storageName: main        
      enabled: true
      keep: 2        
      schedule: */5 * * * *      
      type: logical
  pitr:
    enabled: true
    oplogSpanMin: 3
  storages:
    main:
      s3:
        bucket: {{bucket}}
        credentialsSecret: {{credentialsSecret}}
        endpointUrl: {{endpointUrl}}
        prefix: {{prefix}}
        region: {{region}}
      type: s3

After some time I have the following actual result:

Backup snapshots:
  2023-07-03T05:56:12Z <logical> [restore_to_time: 2023-07-03T05:56:49Z]
  2023-07-03T07:55:21Z <logical> [restore_to_time: 2023-07-03T07:55:58Z]
  2023-07-03T08:25:21Z <logical> [restore_to_time: 2023-07-03T08:25:31Z]
  2023-07-03T09:00:41Z <logical> [restore_to_time: 2023-07-03T09:00:49Z]
  2023-07-03T09:15:41Z <logical> [restore_to_time: 2023-07-03T09:15:51Z]
  2023-07-03T09:25:47Z <logical> [restore_to_time: 2023-07-03T09:26:32Z]
  2023-07-03T09:30:56Z <logical> [restore_to_time: 2023-07-03T09:31:07Z]
  2023-07-03T09:40:55Z <logical> [restore_to_time: 2023-07-03T09:41:04Z]

As you can see keep: 2 doesn’t work, because at this case it would have been only three backups (expected result):

Backup snapshots:
  2023-07-03T05:56:12Z <logical> [restore_to_time: 2023-07-03T05:56:49Z]
  2023-07-03T09:30:56Z <logical> [restore_to_time: 2023-07-03T09:36:07Z]
  2023-07-03T09:40:55Z <logical> [restore_to_time: 2023-07-03T09:41:04Z]

At operator logs I see the following errors:
2023-07-03T09:50:40.305Z ERROR failed to run finalizer {"controller": "perconaservermongodbbackup-controller", "object": {"name":"cron-cloud-documentdb-20230703092500-xppp2","namespace":"documentdb40"}, "namespace": "documentdb40", "name": "cron-cloud-documentdb-20230703092500-xppp2", "reconcileID": "0c3ea5d8-0d7a-4930-a6b9-2e12b99da437", "finalizer": "delete-backup", "error": "unable to delete: backup is a base for '2023-07-03T09:26:32 - 2023-07-03T09:41:04'", "errorVerbose": "unable to delete: backup is a base for '2023-07-03T09:26:32 - 2023-07-03T09:41:04'\ngithub.com/percona/percona-backup-mongodb/pbm.(*PBM).probeDelete\n\t/go/pkg/mod/github.com/percona/percona-backup-mongodb@v0.0.0-20221024072933-3ec38a5fc670/pbm/delete.go:63\ngithub.com/percona/percona-backup-mongodb/pbm.(*PBM).DeleteBackup\n\t/go/pkg/mod/github.com/percona/percona-backup-mongodb@v0.0.0-20221024072933-3ec38a5fc670/pbm/delete.go:29\ngithub.com/percona/percona-server-mongodb-operator/pkg/controller/perconaservermongodbbackup.(*ReconcilePerconaServerMongoDBBackup).checkFinalizers\n\t/go/src/github.com/percona/percona-server-mongodb-operator/pkg/controller/perconaservermongodbbackup/perconaservermongodbbackup_controller.go:366\ngithub.com/percona/percona-server-mongodb-operator/pkg/controller/perconaservermongodbbackup.(*ReconcilePerconaServerMongoDBBackup).Reconcile\n\t/go/src/github.com/percona/percona-server-mongodb-operator/pkg/controller/perconaservermongodbbackup/perconaservermongodbbackup_controller.go:177\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.4/pkg/internal/controller/controller.go:122\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.4/pkg/internal/controller/controller.go:323\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.4/pkg/internal/controller/controller.go:274\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.4/pkg/internal/controller/controller.go:235\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1594"} github.com/percona/percona-server-mongodb-operator/pkg/controller/perconaservermongodbbackup.(*ReconcilePerconaServerMongoDBBackup).checkFinalizers /go/src/github.com/percona/percona-server-mongodb-operator/pkg/controller/perconaservermongodbbackup/perconaservermongodbbackup_controller.go:368 github.com/percona/percona-server-mongodb-operator/pkg/controller/perconaservermongodbbackup.(*ReconcilePerconaServerMongoDBBackup).Reconcile /go/src/github.com/percona/percona-server-mongodb-operator/pkg/controller/perconaservermongodbbackup/perconaservermongodbbackup_controller.go:177 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.4/pkg/internal/controller/controller.go:122 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.4/pkg/internal/controller/controller.go:323 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.4/pkg/internal/controller/controller.go:274 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2 /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.4/pkg/internal/controller/controller.go:235

Moreover pitr ranges have gaps:

 PITR <on>:
  2023-07-03T05:56:50Z - 2023-07-03T07:50:28Z
  2023-07-03T07:55:59Z - 2023-07-03T08:10:37Z
  2023-07-03T08:25:32Z - 2023-07-03T08:35:31Z
  2023-07-03T09:00:50Z - 2023-07-03T09:05:57Z
  2023-07-03T09:15:52Z - 2023-07-03T09:20:48Z
  2023-07-03T09:26:33Z - 2023-07-03T09:41:04Z

I expected the only one continuous range:

 PITR <on>:
  2023-07-03T05:56:50Z - 2023-07-03T09:41:04Z

At a backup-agent pod I see logs:

2023-07-03T07:53:38.000+0000 E [pitr] streaming oplog: epoch mismatch. Got sleep in {1688370631 9}, woke up in {1688370639 1}. Too old for that stuff.
2023-07-03T07:55:54.000+0000 E [pitr] streaming oplog: get backup start TS: run out of tries

Environment:
percona-server-mongodb-operator:1.14.0
percona-server-mongodb:6.0.4-3
percona-backup-mongodb:2.0.4

P.S. One question aside - what is a proper way to choose specific backup as a base for pitr when I want to make restore?

Thanks!

Hello @SlavaUtesinov ,

thanks for sharing it. Well written and quite easy to understand.

I see that your schedule takes backups every 5 mins. I have a feeling that backups might not even complete when the new one starts and this causes the prob/ Do you have some data in the cluster? If so, how much (GBs)?

I will try to reproduce it this or next week.

Hi, @Sergey_Pronin!

My cluster is absolutely empty. Moreover, if I switch on only scheduled backups with the same rate(cron) but without pitr feature, scheduled backups work as expected and keep parameter is strictly honored.

P.S. I was monitoring the cluster for twenty-four hours to catch this behavior.

Ah, makes sense. Got it now.

We have a feature request to support PITR retention: [K8SPSMDB-801] Add support for PITR retention - Percona JIRA

We will add it in 1.15 release.
But I’m still confused a bit with the chunk gaps.