Throttling error - mongodb restore in kubernetes

I am attempting to restore a large backup (20G+) into a Percona MongoDB installation (replica set size of 3). The restore begins, restores some collection data, and fails after about 30 seconds with this error:

“check cluster for restore dump done: convergeCluster: lost shard rs0, last beat ts: 1619797459”

I have allocated about 5G of memory per rs0 replica, but this appears to be a surge/burst issue during the large restore. Any recommendations? Can the restore bandwidth be throttled?

1 Like

Hello @Vic_Gunter ,

thank you for submitting this!
Judging by the name of the ReplicaSet (rs0), I assume you use Percona Operator to deploy the cluster on Kubernetes. Is it correct?

  1. If so, could you please share you cr.yaml?
  2. Where do you see the error exactly?
  3. Are you trying to recover from psmdb-backup object in k8s or it is some other manual way?
  4. Anything else that we should know about the cluster/backup/data to reproduce the same issue?
1 Like

Yes - installed via the Percona operator instructions.

  1. What is the best method to share a file here? Uploading the yaml extension was denied
  2. The error is in the restore mongodb collection: located here: System->admin->Collections->pbmRestores
  3. I am recovering from a successful backup to external S3 storage (the backup is about 20GB in size)
1 Like

Contents of the cr.yaml file:

apiVersion: psmdb.percona.com/v1-7-0
kind: PerconaServerMongoDB
metadata:
  name: og-percona-cluster # VG: Global change
#  finalizers:
#    - delete-psmdb-pvc
spec:
  #  platform: openshift
  #  clusterServiceDNSSuffix: svc.cluster.local
  #  pause: true
  crVersion: 1.7.0
  image: percona/percona-server-mongodb:4.4.3-5
  imagePullPolicy: Always
  #  imagePullSecrets:
  #    - name: private-registry-credentials
  #  runUid: 1001
  allowUnsafeConfigurations: false
  updateStrategy: SmartUpdate
  upgradeOptions:
    versionServiceEndpoint: https://check.percona.com
    apply: recommended
    schedule: "0 2 * * *"
  secrets:
    users: og-percona-cluster-secrets
  pmm:
    enabled: true
    image: percona/pmm-client:2.12.0
    serverHost: percona-monitor-service
  #    mongodParams: --environment=ENVIRONMENT
  #    mongosParams: --environment=ENVIRONMENT
  replsets:
    - name: rs0
      size: 3
      # storage: # VG: Uncommented this entire storage section
      #   engine: wiredTiger
      #   inMemory:
      #     engineConfig:
      #       inMemorySizeRatio: 0.9
      #   wiredTiger:
      #     engineConfig:
      #       cacheSizeRatio: 0.5
      #       directoryForIndexes: false
      #       journalCompressor: snappy
      #     collectionConfig:
      #       blockCompressor: snappy
      #     indexConfig:
      #       prefixCompression: true
      affinity:
        antiAffinityTopologyKey: "kubernetes.io/hostname"
      #      advanced:
      #        nodeAffinity:
      #          requiredDuringSchedulingIgnoredDuringExecution:
      #            nodeSelectorTerms:
      #            - matchExpressions:
      #              - key: kubernetes.io/e2e-az-name
      #                operator: In
      #                values:
      #                - e2e-az1
      #                - e2e-az2
      #    tolerations:
      #    - key: "node.alpha.kubernetes.io/unreachable"
      #      operator: "Exists"
      #      effect: "NoExecute"
      #      tolerationSeconds: 6000
      #    priorityClassName: high-priority
      #    annotations:
      #      iam.amazonaws.com/role: role-arn
      #    labels:
      #      rack: rack-22
      #    nodeSelector:
      #      disktype: ssd
      #    livenessProbe:
      #      failureThreshold: 4
      #      initialDelaySeconds: 60
      #      periodSeconds: 30
      #      successThreshold: 1
      #      timeoutSeconds: 5
      #      startupDelaySeconds: 7200
      #    runtimeClassName: image-rc
      #    sidecars:
      #    - image: busybox
      #      command: ["/bin/sh"]
      #      args: ["-c", "while true; do echo echo $(date -u) 'test' >> /dev/null; sleep 5;done"]
      #      name: rs-sidecar-1
      podDisruptionBudget:
        maxUnavailable: 1
      #      minAvailable: 0
      expose:
        enabled: false
        exposeType: LoadBalancer
      #      loadBalancerSourceRanges:
      #        - 10.0.0.0/8
      #      serviceAnnotations:
      #        service.beta.kubernetes.io/aws-load-balancer-backend-protocol: http
      arbiter:
        enabled: false
        size: 1
        affinity:
          antiAffinityTopologyKey: "kubernetes.io/hostname"
      #        advanced:
      #          nodeAffinity:
      #            requiredDuringSchedulingIgnoredDuringExecution:
      #              nodeSelectorTerms:
      #              - matchExpressions:
      #                - key: kubernetes.io/e2e-az-name
      #                  operator: In
      #                  values:
      #                  - e2e-az1
      #                  - e2e-az2
      #      tolerations:
      #      - key: "node.alpha.kubernetes.io/unreachable"
      #        operator: "Exists"
      #        effect: "NoExecute"
      #        tolerationSeconds: 6000
      #      priorityClassName: high-priority
      #      annotations:
      #        iam.amazonaws.com/role: role-arn
      #      labels:
      #        rack: rack-22
      #      nodeSelector:
      #        disktype: ssd
      #    schedulerName: "default"
      resources:
        limits:
          cpu: "300m"
          memory: "5G" # VG
        requests:
          cpu: "300m"
          memory: "500m" # VG
      volumeSpec:
        #      emptyDir: {}
        #      hostPath:
        #        path: /data
        #        type: Directory
        persistentVolumeClaim:
          #        storageClassName: standard
          #        accessModes: [ "ReadWriteOnce" ]
          resources:
            requests:
              storage: 80Gi # VG

  sharding:
    enabled: true

    configsvrReplSet:
      size: 3
      affinity:
        antiAffinityTopologyKey: "kubernetes.io/hostname"
      #        advanced:
      #          nodeAffinity:
      #            requiredDuringSchedulingIgnoredDuringExecution:
      #              nodeSelectorTerms:
      #              - matchExpressions:
      #                - key: kubernetes.io/e2e-az-name
      #                  operator: In
      #                  values:
      #                  - e2e-az1
      #                  - e2e-az2
      #      tolerations:
      #      - key: "node.alpha.kubernetes.io/unreachable"
      #        operator: "Exists"
      #        effect: "NoExecute"
      #        tolerationSeconds: 6000
      #      priorityClassName: high-priority
      #      annotations:
      #        iam.amazonaws.com/role: role-arn
      #      labels:
      #        rack: rack-22
      #      nodeSelector:
      #        disktype: ssd
      #      storage:
      #        engine: wiredTiger
      #        wiredTiger:
      #          engineConfig:
      #            cacheSizeRatio: 0.5
      #            directoryForIndexes: false
      #            journalCompressor: snappy
      #          collectionConfig:
      #            blockCompressor: snappy
      #          indexConfig:
      #            prefixCompression: true
      #      runtimeClassName: image-rc
      #      sidecars:
      #      - image: busybox
      #        command: ["/bin/sh"]
      #        args: ["-c", "while true; do echo echo $(date -u) 'test' >> /dev/null; sleep 5;done"]
      #        name: rs-sidecar-1
      podDisruptionBudget:
        maxUnavailable: 1
      resources:
        limits:
          cpu: "300m"
          memory: "0.5G"
        requests:
          cpu: "300m"
          memory: "0.5G"
      volumeSpec:
        #       emptyDir: {}
        #       hostPath:
        #         path: /data
        #         type: Directory
        persistentVolumeClaim:
          #          storageClassName: standard
          #          accessModes: [ "ReadWriteOnce" ]
          resources:
            requests:
              storage: 3Gi

    mongos:
      size: 3
      affinity:
        antiAffinityTopologyKey: "kubernetes.io/hostname"
      #        advanced:
      #          nodeAffinity:
      #            requiredDuringSchedulingIgnoredDuringExecution:
      #              nodeSelectorTerms:
      #              - matchExpressions:
      #                - key: kubernetes.io/e2e-az-name
      #                  operator: In
      #                  values:
      #                  - e2e-az1
      #                  - e2e-az2
      #      tolerations:
      #      - key: "node.alpha.kubernetes.io/unreachable"
      #        operator: "Exists"
      #        effect: "NoExecute"
      #        tolerationSeconds: 6000
      #      priorityClassName: high-priority
      #      annotations:
      #        iam.amazonaws.com/role: role-arn
      #      labels:
      #        rack: rack-22
      #      nodeSelector:
      #        disktype: ssd
      #      runtimeClassName: image-rc
      #      sidecars:
      #      - image: busybox
      #        command: ["/bin/sh"]
      #        args: ["-c", "while true; do echo echo $(date -u) 'test' >> /dev/null; sleep 5;done"]
      #        name: rs-sidecar-1
      podDisruptionBudget:
        maxUnavailable: 1
      resources:
        limits:
          cpu: "300m"
          memory: "0.5G"
        requests:
          cpu: "300m"
          memory: "0.5G"
      expose:
        exposeType: ClusterIP
  #        loadBalancerSourceRanges:
  #          - 10.0.0.0/8
  #        serviceAnnotations:
  #          service.beta.kubernetes.io/aws-load-balancer-backend-protocol: http
  #      auditLog:
  #        destination: file
  #        format: BSON
  #        filter: '{}'

  mongod:
    net:
      port: 27017
      hostPort: 0
    security:
      redactClientLogData: false
      enableEncryption: true
      encryptionKeySecret: og-percona-cluster-mongodb-encryption-key
      encryptionCipherMode: AES256-CBC
    setParameter:
      ttlMonitorSleepSecs: 60
      wiredTigerConcurrentReadTransactions: 128
      wiredTigerConcurrentWriteTransactions: 128
    storage:
      engine: wiredTiger
      inMemory:
        engineConfig:
          inMemorySizeRatio: 0.9
      wiredTiger:
        engineConfig:
          cacheSizeRatio: 0.5
          directoryForIndexes: false
          journalCompressor: snappy
        collectionConfig:
          blockCompressor: snappy
        indexConfig:
          prefixCompression: true
    operationProfiling:
      mode: slowOp
      slowOpThresholdMs: 100
      rateLimit: 100
  #    auditLog:
  #      destination: file
  #      format: BSON
  #      filter: '{}'

  backup: # VG: Setup backups and resources
    enabled: true
    restartOnFailure: true
    image: percona/percona-server-mongodb-operator:1.7.0-backup
    serviceAccountName: percona-server-mongodb-operator
    resources:
      limits:
        cpu: "300m"
        memory: "0.5G"
      requests:
        cpu: "300m"
        memory: "0.5G"
    storages:
      #      s3-us-west:
      #        type: s3
      #        s3:
      #          bucket: S3-BACKUP-BUCKET-NAME-HERE
      #          credentialsSecret: og-percona-cluster-backup-s3
      #          region: us-west-2
      #      minio:
      #        type: s3
      #        s3:
      #          bucket: MINIO-BACKUP-BUCKET-NAME-HERE
      #          region: us-east-1
      #          credentialsSecret: og-percona-cluster-backup-minio
      #          endpointUrl: http://minio.psmdb.svc.cluster.local:9000/minio/
      s3-us-east-2:
        type: s3
        s3:
          bucket: newsreader-backup
          region: us-east-2
          credentialsSecret: og-percona-cluster-backup-s3
          endpointUrl: s3.us-east-2.wasabisys.com
    tasks:
      #      - name: daily-s3-us-west
      #        enabled: true
      #        schedule: "0 0 * * *"
      #        keep: 3
      #        storageName: s3-us-west
      #        compressionType: gzip
      - name: weekly-s3-us-east-2
        enabled: true
        schedule: "0 0 * * 0" # At 00:00 on Sunday - see https://crontab.guru/
        keep: 3
        storageName: s3-us-east-2
        compressionType: gzip
1 Like

@Vic_Gunter there are few things that I see:

  1. I have never tested how wasabi works, but I assume it is 100% S3 compatible. It might be worth to try with another S3-compatible storage.
  2. We use Percona Backup for MongoDB to take the backup and restore in the Operator. Could you please check if there is anything suspicious in the logs?

kubectl logs --tail=2000 og-percona-cluster-rs0-0 -с backup-agent

  1. In your monitoring system - do you see the container hitting the memory limits? 5 GB should be enough for the most of the cases.
1 Like