I am attempting to restore a large backup (20G+) into a Percona MongoDB installation (replica set size of 3). The restore begins, restores some collection data, and fails after about 30 seconds with this error:
“check cluster for restore dump done: convergeCluster: lost shard rs0, last beat ts: 1619797459”
I have allocated about 5G of memory per rs0 replica, but this appears to be a surge/burst issue during the large restore. Any recommendations? Can the restore bandwidth be throttled?
1 Like
Hello @Vic_Gunter ,
thank you for submitting this!
Judging by the name of the ReplicaSet (rs0), I assume you use Percona Operator to deploy the cluster on Kubernetes. Is it correct?
- If so, could you please share you cr.yaml?
- Where do you see the error exactly?
- Are you trying to recover from psmdb-backup object in k8s or it is some other manual way?
- Anything else that we should know about the cluster/backup/data to reproduce the same issue?
1 Like
Yes - installed via the Percona operator instructions.
- What is the best method to share a file here? Uploading the yaml extension was denied
- The error is in the restore mongodb collection: located here: System->admin->Collections->pbmRestores
- I am recovering from a successful backup to external S3 storage (the backup is about 20GB in size)
1 Like
Contents of the cr.yaml file:
apiVersion: psmdb.percona.com/v1-7-0
kind: PerconaServerMongoDB
metadata:
name: og-percona-cluster # VG: Global change
# finalizers:
# - delete-psmdb-pvc
spec:
# platform: openshift
# clusterServiceDNSSuffix: svc.cluster.local
# pause: true
crVersion: 1.7.0
image: percona/percona-server-mongodb:4.4.3-5
imagePullPolicy: Always
# imagePullSecrets:
# - name: private-registry-credentials
# runUid: 1001
allowUnsafeConfigurations: false
updateStrategy: SmartUpdate
upgradeOptions:
versionServiceEndpoint: https://check.percona.com
apply: recommended
schedule: "0 2 * * *"
secrets:
users: og-percona-cluster-secrets
pmm:
enabled: true
image: percona/pmm-client:2.12.0
serverHost: percona-monitor-service
# mongodParams: --environment=ENVIRONMENT
# mongosParams: --environment=ENVIRONMENT
replsets:
- name: rs0
size: 3
# storage: # VG: Uncommented this entire storage section
# engine: wiredTiger
# inMemory:
# engineConfig:
# inMemorySizeRatio: 0.9
# wiredTiger:
# engineConfig:
# cacheSizeRatio: 0.5
# directoryForIndexes: false
# journalCompressor: snappy
# collectionConfig:
# blockCompressor: snappy
# indexConfig:
# prefixCompression: true
affinity:
antiAffinityTopologyKey: "kubernetes.io/hostname"
# advanced:
# nodeAffinity:
# requiredDuringSchedulingIgnoredDuringExecution:
# nodeSelectorTerms:
# - matchExpressions:
# - key: kubernetes.io/e2e-az-name
# operator: In
# values:
# - e2e-az1
# - e2e-az2
# tolerations:
# - key: "node.alpha.kubernetes.io/unreachable"
# operator: "Exists"
# effect: "NoExecute"
# tolerationSeconds: 6000
# priorityClassName: high-priority
# annotations:
# iam.amazonaws.com/role: role-arn
# labels:
# rack: rack-22
# nodeSelector:
# disktype: ssd
# livenessProbe:
# failureThreshold: 4
# initialDelaySeconds: 60
# periodSeconds: 30
# successThreshold: 1
# timeoutSeconds: 5
# startupDelaySeconds: 7200
# runtimeClassName: image-rc
# sidecars:
# - image: busybox
# command: ["/bin/sh"]
# args: ["-c", "while true; do echo echo $(date -u) 'test' >> /dev/null; sleep 5;done"]
# name: rs-sidecar-1
podDisruptionBudget:
maxUnavailable: 1
# minAvailable: 0
expose:
enabled: false
exposeType: LoadBalancer
# loadBalancerSourceRanges:
# - 10.0.0.0/8
# serviceAnnotations:
# service.beta.kubernetes.io/aws-load-balancer-backend-protocol: http
arbiter:
enabled: false
size: 1
affinity:
antiAffinityTopologyKey: "kubernetes.io/hostname"
# advanced:
# nodeAffinity:
# requiredDuringSchedulingIgnoredDuringExecution:
# nodeSelectorTerms:
# - matchExpressions:
# - key: kubernetes.io/e2e-az-name
# operator: In
# values:
# - e2e-az1
# - e2e-az2
# tolerations:
# - key: "node.alpha.kubernetes.io/unreachable"
# operator: "Exists"
# effect: "NoExecute"
# tolerationSeconds: 6000
# priorityClassName: high-priority
# annotations:
# iam.amazonaws.com/role: role-arn
# labels:
# rack: rack-22
# nodeSelector:
# disktype: ssd
# schedulerName: "default"
resources:
limits:
cpu: "300m"
memory: "5G" # VG
requests:
cpu: "300m"
memory: "500m" # VG
volumeSpec:
# emptyDir: {}
# hostPath:
# path: /data
# type: Directory
persistentVolumeClaim:
# storageClassName: standard
# accessModes: [ "ReadWriteOnce" ]
resources:
requests:
storage: 80Gi # VG
sharding:
enabled: true
configsvrReplSet:
size: 3
affinity:
antiAffinityTopologyKey: "kubernetes.io/hostname"
# advanced:
# nodeAffinity:
# requiredDuringSchedulingIgnoredDuringExecution:
# nodeSelectorTerms:
# - matchExpressions:
# - key: kubernetes.io/e2e-az-name
# operator: In
# values:
# - e2e-az1
# - e2e-az2
# tolerations:
# - key: "node.alpha.kubernetes.io/unreachable"
# operator: "Exists"
# effect: "NoExecute"
# tolerationSeconds: 6000
# priorityClassName: high-priority
# annotations:
# iam.amazonaws.com/role: role-arn
# labels:
# rack: rack-22
# nodeSelector:
# disktype: ssd
# storage:
# engine: wiredTiger
# wiredTiger:
# engineConfig:
# cacheSizeRatio: 0.5
# directoryForIndexes: false
# journalCompressor: snappy
# collectionConfig:
# blockCompressor: snappy
# indexConfig:
# prefixCompression: true
# runtimeClassName: image-rc
# sidecars:
# - image: busybox
# command: ["/bin/sh"]
# args: ["-c", "while true; do echo echo $(date -u) 'test' >> /dev/null; sleep 5;done"]
# name: rs-sidecar-1
podDisruptionBudget:
maxUnavailable: 1
resources:
limits:
cpu: "300m"
memory: "0.5G"
requests:
cpu: "300m"
memory: "0.5G"
volumeSpec:
# emptyDir: {}
# hostPath:
# path: /data
# type: Directory
persistentVolumeClaim:
# storageClassName: standard
# accessModes: [ "ReadWriteOnce" ]
resources:
requests:
storage: 3Gi
mongos:
size: 3
affinity:
antiAffinityTopologyKey: "kubernetes.io/hostname"
# advanced:
# nodeAffinity:
# requiredDuringSchedulingIgnoredDuringExecution:
# nodeSelectorTerms:
# - matchExpressions:
# - key: kubernetes.io/e2e-az-name
# operator: In
# values:
# - e2e-az1
# - e2e-az2
# tolerations:
# - key: "node.alpha.kubernetes.io/unreachable"
# operator: "Exists"
# effect: "NoExecute"
# tolerationSeconds: 6000
# priorityClassName: high-priority
# annotations:
# iam.amazonaws.com/role: role-arn
# labels:
# rack: rack-22
# nodeSelector:
# disktype: ssd
# runtimeClassName: image-rc
# sidecars:
# - image: busybox
# command: ["/bin/sh"]
# args: ["-c", "while true; do echo echo $(date -u) 'test' >> /dev/null; sleep 5;done"]
# name: rs-sidecar-1
podDisruptionBudget:
maxUnavailable: 1
resources:
limits:
cpu: "300m"
memory: "0.5G"
requests:
cpu: "300m"
memory: "0.5G"
expose:
exposeType: ClusterIP
# loadBalancerSourceRanges:
# - 10.0.0.0/8
# serviceAnnotations:
# service.beta.kubernetes.io/aws-load-balancer-backend-protocol: http
# auditLog:
# destination: file
# format: BSON
# filter: '{}'
mongod:
net:
port: 27017
hostPort: 0
security:
redactClientLogData: false
enableEncryption: true
encryptionKeySecret: og-percona-cluster-mongodb-encryption-key
encryptionCipherMode: AES256-CBC
setParameter:
ttlMonitorSleepSecs: 60
wiredTigerConcurrentReadTransactions: 128
wiredTigerConcurrentWriteTransactions: 128
storage:
engine: wiredTiger
inMemory:
engineConfig:
inMemorySizeRatio: 0.9
wiredTiger:
engineConfig:
cacheSizeRatio: 0.5
directoryForIndexes: false
journalCompressor: snappy
collectionConfig:
blockCompressor: snappy
indexConfig:
prefixCompression: true
operationProfiling:
mode: slowOp
slowOpThresholdMs: 100
rateLimit: 100
# auditLog:
# destination: file
# format: BSON
# filter: '{}'
backup: # VG: Setup backups and resources
enabled: true
restartOnFailure: true
image: percona/percona-server-mongodb-operator:1.7.0-backup
serviceAccountName: percona-server-mongodb-operator
resources:
limits:
cpu: "300m"
memory: "0.5G"
requests:
cpu: "300m"
memory: "0.5G"
storages:
# s3-us-west:
# type: s3
# s3:
# bucket: S3-BACKUP-BUCKET-NAME-HERE
# credentialsSecret: og-percona-cluster-backup-s3
# region: us-west-2
# minio:
# type: s3
# s3:
# bucket: MINIO-BACKUP-BUCKET-NAME-HERE
# region: us-east-1
# credentialsSecret: og-percona-cluster-backup-minio
# endpointUrl: http://minio.psmdb.svc.cluster.local:9000/minio/
s3-us-east-2:
type: s3
s3:
bucket: newsreader-backup
region: us-east-2
credentialsSecret: og-percona-cluster-backup-s3
endpointUrl: s3.us-east-2.wasabisys.com
tasks:
# - name: daily-s3-us-west
# enabled: true
# schedule: "0 0 * * *"
# keep: 3
# storageName: s3-us-west
# compressionType: gzip
- name: weekly-s3-us-east-2
enabled: true
schedule: "0 0 * * 0" # At 00:00 on Sunday - see https://crontab.guru/
keep: 3
storageName: s3-us-east-2
compressionType: gzip
1 Like
@Vic_Gunter there are few things that I see:
- I have never tested how wasabi works, but I assume it is 100% S3 compatible. It might be worth to try with another S3-compatible storage.
- We use Percona Backup for MongoDB to take the backup and restore in the Operator. Could you please check if there is anything suspicious in the logs?
kubectl logs --tail=2000 og-percona-cluster-rs0-0 -с backup-agent
- In your monitoring system - do you see the container hitting the memory limits? 5 GB should be enough for the most of the cases.
1 Like