All Backups Error

Hi,

Hoping someone can help! We’ve set up a MongoDB Sharded Cluster via the MongoDB Operator - all working well, performance is great, and we’re really happy. However - the last thing we need to work out is backups. We’re on AWS, trying to upload to S3.

I’ve followed the guide and done the following:

  • Updated cr.yaml to enable backups, and set the details
  • Applied a backup-secret.yaml file, containing B64-encoded AWS Creds
  • Added a daily task to backup in cr.yaml

All of the backups error, either manually triggered ones or ones from the task.

cr.yaml:

  backup:
    enabled: true
    debug: true
    restartOnFailure: true
    image: percona/percona-server-mongodb-operator:1.11.0-backup
    serviceAccountName: percona-server-mongodb-operator
    storages:
     s3-eu-west-2:
       type: s3
       s3:
         bucket: <bucket-name>
         credentialsSecret: my-cluster-name-backup-s3
         region: eu-west-2
    pitr:
      enabled: false
    tasks:
     - name: daily-s3-eu-west
       enabled: true
       schedule: "0 0 * * *"
       keep: 3
       storageName: s3-eu-west-2
       compressionType: gzip

backup-secret.yaml:

apiVersion: v1
kind: Secret
metadata:
  name: my-cluster-name-backup-s3
type: Opaque
data:
  AWS_ACCESS_KEY_ID: <key>
  AWS_SECRET_ACCESS_KEY: <key>

The output of “kubectl get psmdb-backup” is as follows:

NAME                                        CLUSTER           STORAGE        DESTINATION            STATUS   COMPLETED   AGE
backup1                                     my-cluster-name   s3-eu-west-2   2022-03-21T09:23:33Z   error                87m
backup2                                     my-cluster-name   s3-eu-west-2   2022-03-21T10:08:05Z   error                42m
cron-my-cluster-name-20220317000002-d5nwq   my-cluster-name   s3-eu-west-2   2022-03-17T00:00:24Z   error                4d10h
cron-my-cluster-name-20220318000002-tzmhn   my-cluster-name   s3-eu-west-2   2022-03-18T00:00:24Z   error                3d10h
cron-my-cluster-name-20220319000002-lc54j   my-cluster-name   s3-eu-west-2   2022-03-19T00:00:24Z   error                2d10h
cron-my-cluster-name-20220320000001-mcrzw   my-cluster-name   s3-eu-west-2   2022-03-20T00:00:23Z   error                34h
cron-my-cluster-name-20220321000001-2t876   my-cluster-name   s3-eu-west-2   2022-03-21T00:00:23Z   error                10h

And the error I get from all of the backups (from running “kubectl get psmdb-backup -o yaml”) is as follows:

apiVersion: psmdb.percona.com/v1
kind: PerconaServerMongoDBBackup
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"psmdb.percona.com/v1","kind":"PerconaServerMongoDBBackup","metadata":{"annotations":{},"name":"backup1","namespace":"mongo"},"spec":{"psmdbCluster":"my-cluster-name","storageName":"s3-eu-west-2"}}
  creationTimestamp: "2022-03-21T09:23:12Z"
  generation: 1
  name: backup1
  namespace: mongo
  resourceVersion: "2189522"
  uid: 89007408-3439-4b52-a406-af71643f7b40
spec:
  psmdbCluster: my-cluster-name
  storageName: s3-eu-west-2
status:
  azure:
    credentialsSecret: ""
  destination: "2022-03-21T09:23:33Z"
  error: starting deadline exceeded
  lastTransition: "2022-03-21T09:23:34Z"
  pbmName: "2022-03-21T09:23:33Z"
  s3:
    bucket: <bucket>
    credentialsSecret: my-cluster-name-backup-s3
    region: eu-west-2
  start: "2022-03-21T09:23:34Z"
  state: error
  storageName: s3-eu-west-2

Can anyone help me diagnose?

1 Like

Hi @Geo !
Seems this issue can happen in some environments and especially when sharding used so I have opened a ticket here: [K8SPSMDB-660] backup error - starting deadline exceeded - Percona JIRA
and recommend you watch it for when the fix will be released.

Thanks for reporting!

1 Like

Thanks @Tomislav_Plavcic :slight_smile:

Will keep an eye on that issue. If there’s anything we can do to mitigate the issue, even manually, it’d really help! I’m a bit concerned we’ve got a production server not being backed up at the moment :sweat_smile:

1 Like

Hi @Geo !
Here’s what worked for me. I would advise maybe to try it on some test cluster to see if the steps will work for you.
So what happens in my case is that PBM actually finishes the backup, but because of the bug psmdb-backup resource is marked as error. So I just went and manually changed the resource status and then did the restore and for me it worked.

Now due to this: https://github.com/kubernetes/kubectl/issues/564
seems you cannot change the status directly, but there is this kubectl plugin which allows that: GitHub - ulucinar/kubectl-edit-status: A kubectl plugin for editing /status subresource

I had situation like this:

NAME      CLUSTER           STORAGE      DESTINATION            STATUS   COMPLETED   AGE
backup1   my-cluster-name   s3-us-west   2022-03-21T15:34:27Z   ready    6m31s       7m11s
backup2   my-cluster-name   s3-us-west   2022-03-21T15:35:26Z   ready    5m33s       6m13s
backup3   my-cluster-name   s3-us-west   2022-03-21T15:36:22Z   ready    4m36s       5m16s
backup4   my-cluster-name   s3-us-west   2022-03-21T15:39:11Z   error                2m27s

but wanted to restore from backup4 so I did: kubectl edit-status psmdb-backup backup4
and there in status section:

  • deleted “error” line
  • changed state from “error” to “ready”
  • used the same datetime from “lastTransition:” to add “completed:” line, so in my case added completed: "2022-03-21T10:39:55Z"
    saved and then did the restore and seems restore went ok.
NAME      CLUSTER           STORAGE      DESTINATION            STATUS   COMPLETED   AGE
backup1   my-cluster-name   s3-us-west   2022-03-21T15:34:27Z   ready    8m53s       9m33s
backup2   my-cluster-name   s3-us-west   2022-03-21T15:35:26Z   ready    7m55s       8m35s
backup3   my-cluster-name   s3-us-west   2022-03-21T15:36:22Z   ready    6m58s       7m38s
backup4   my-cluster-name   s3-us-west   2022-03-21T15:39:11Z   ready    4m27s       4m49s

NAME       CLUSTER           STATUS   AGE
restore1   my-cluster-name   ready    63s

Now for this you have to be sure that the PBM backup actually went ok (check backup-agent container logs, PBM backup status in admin.pbm* collections in mongo, maybe data on S3).

Also another option is to ignore the backup status (that is if you checked backup status manually) and do a restore without spec.backupName field and instead use spec.backupSource to specify bucket and storage like you are restoring to another cluster which doesn’t have psmdb-backup resource available. You can try to run this on the test cluster.

I hope this helps, but also the ticket is currently linked to the next release so hopefully it will be fixed soon!

2 Likes

Thanks @Tomislav_Plavcic for all your help on this!

As it turns out… I was being stupid :man_facepalming: Your solution did push me to go look at the pbmBackups collection in Mongo (I didn’t know that existed before!) which illustrated the issue - my AWS creds had line breaks at the end.

Problem exists between keyboard and chair :smile:

1 Like