Hello everyone, I’m facing an issue with the backups on k8s, the backup works, the issue is that sometimes it stops working because of lock, the scheduled backups fails to release the lock which results in all of the next schedules to fail, here is an example of the logs
2024-06-15T12:30:28.000+0000 I [backup/2024-06-15T12:30:21Z] mongodump finished, waiting for the oplog
2024-06-15T12:30:31.000+0000 I [backup/2024-06-15T12:30:21Z] created chunk 2024-06-15T12:30:22 - 2024-06-15T12:30:29
2024-06-15T12:30:31.000+0000 I [backup/2024-06-15T12:30:21Z] dropping tmp collections
2024-06-15T12:30:33.000+0000 I [backup/2024-06-15T12:30:21Z] backup finished
2024-06-15T12:30:33.000+0000 D [backup/2024-06-15T12:30:21Z] releasing lock
2024-06-15T12:40:42.000+0000 D [pitr] start_catchup
2024-06-15T12:40:42.000+0000 I [pitr] streaming started from 2024-06-15 12:30:29 +0000 UTC / 1718454629
2024-06-15T12:40:42.000+0000 D [pitr] start_ok
2024-06-15T12:50:44.000+0000 I [pitr] created chunk 2024-06-15T12:30:29 - 2024-06-15T12:50:42. Next chunk creation scheduled to begin at ~2024-06-15 13:00:44.215192394 +0000 UTC m=+11916.676303292
2024-06-15T13:00:43.000+0000 I [pitr] created chunk 2024-06-15T12:50:42 - 2024-06-15T13:00:42. Next chunk creation scheduled to begin at ~2024-06-15 13:10:43.915229371 +0000 UTC m=+12516.376340169
2024-06-15T13:10:43.000+0000 I [pitr] created chunk 2024-06-15T13:00:42 - 2024-06-15T13:10:42. Next chunk creation scheduled to begin at ~2024-06-15 13:20:43.938139835 +0000 UTC m=+13116.399250632
2024-06-15T13:20:42.000+0000 I [pitr] created chunk 2024-06-15T13:10:42 - 2024-06-15T13:20:40. Next chunk creation scheduled to begin at ~2024-06-15 13:30:42.322789525 +0000 UTC m=+13714.783900319
2024-06-15T13:30:43.000+0000 I [pitr] created chunk 2024-06-15T13:20:40 - 2024-06-15T13:30:42. Next chunk creation scheduled to begin at ~2024-06-15 13:40:43.905256692 +0000 UTC m=+14316.366367686
2024-06-15T13:40:44.000+0000 I [pitr] created chunk 2024-06-15T13:30:42 - 2024-06-15T13:40:42. Next chunk creation scheduled to begin at ~2024-06-15 13:50:44.157787508 +0000 UTC m=+14916.618898386
2024-06-15T13:50:43.000+0000 I [pitr] created chunk 2024-06-15T13:40:42 - 2024-06-15T13:50:42. Next chunk creation scheduled to begin at ~2024-06-15 14:00:43.938935662 +0000 UTC m=+15516.400046575
2024-06-15T14:00:43.000+0000 I [pitr] created chunk 2024-06-15T13:50:42 - 2024-06-15T14:00:42. Next chunk creation scheduled to begin at ~2024-06-15 14:10:43.888778162 +0000 UTC m=+16116.349889003
2024-06-15T14:10:43.000+0000 I [pitr] created chunk 2024-06-15T14:00:42 - 2024-06-15T14:10:42. Next chunk creation scheduled to begin at ~2024-06-15 14:20:43.940505153 +0000 UTC m=+16716.401616356
2024-06-15T14:20:43.000+0000 I [pitr] created chunk 2024-06-15T14:10:42 - 2024-06-15T14:20:42. Next chunk creation scheduled to begin at ~2024-06-15 14:30:43.922878874 +0000 UTC m=+17316.383989766
2024-06-15T14:30:44.000+0000 I [pitr] created chunk 2024-06-15T14:20:42 - 2024-06-15T14:30:42. Next chunk creation scheduled to begin at ~2024-06-15 14:40:44.151584244 +0000 UTC m=+17916.612695513
2024-06-15T14:40:44.000+0000 I [pitr] created chunk 2024-06-15T14:30:42 - 2024-06-15T14:40:42. Next chunk creation scheduled to begin at ~2024-06-15 14:50:44.041599793 +0000 UTC m=+18516.502710588
2024-06-15T14:50:43.000+0000 I [pitr] created chunk 2024-06-15T14:40:42 - 2024-06-15T14:50:42. Next chunk creation scheduled to begin at ~2024-06-15 15:00:43.924400731 +0000 UTC m=+19116.385511604
2024-06-15T15:00:43.000+0000 I [pitr] created chunk 2024-06-15T14:50:42 - 2024-06-15T15:00:42. Next chunk creation scheduled to begin at ~2024-06-15 15:10:43.89528388 +0000 UTC m=+19716.356394756
2024-06-15T15:10:43.000+0000 I [pitr] created chunk 2024-06-15T15:00:42 - 2024-06-15T15:10:42. Next chunk creation scheduled to begin at ~2024-06-15 15:20:43.883844751 +0000 UTC m=+20316.344955610
2024-06-15T15:20:44.000+0000 I [pitr] created chunk 2024-06-15T15:10:42 - 2024-06-15T15:20:42. Next chunk creation scheduled to begin at ~2024-06-15 15:30:44.115105297 +0000 UTC m=+20916.576216092
2024-06-15T15:30:42.000+0000 E [pitr] streaming oplog: epoch mismatch. Got sleep in {1718454910 4}, woke up in {1718465430 2}. Too old for that stuff.
2024-06-15T15:30:42.000+0000 I got command backup [name: 2024-06-15T15:30:42Z, compression: gzip (level: default)] <ts: 1718465442>
2024-06-15T15:30:42.000+0000 I got epoch {1718465430 2}
2024-06-15T15:30:43.000+0000 D [backup/2024-06-15T15:30:42Z] skip after nomination, probably started by another node
2024-06-15T15:30:43.000+0000 I [pitr] oplog slicer is paused for lock [Snapshot backup, opid: 666db3a2ac3d7cb8b24a024d]
2024-06-15T18:33:52.000+0000 I got command backup [name: 2024-06-15T18:33:51Z, compression: gzip (level: default)] <ts: 1718476431>
2024-06-15T18:33:52.000+0000 I got epoch {1718476420 2}
2024-06-15T21:33:31.000+0000 I got command backup [name: 2024-06-15T21:33:30Z, compression: gzip (level: default)] <ts: 1718487210>
2024-06-15T21:33:31.000+0000 I got epoch {1718487198 3}
2024-06-15T21:33:31.000+0000 D [backup/2024-06-15T21:33:30Z] get lock: another operation is running: Snapshot backup '666dde8fac3d7cb8b24a3aaa'
2024-06-15T21:33:31.000+0000 D [backup/2024-06-15T21:33:30Z] skip: lock not acquired
2024-06-16T00:01:03.000+0000 I got command backup [name: 2024-06-16T00:01:03Z, compression: gzip (level: default)] <ts: 1718496063>
2024-06-16T00:01:03.000+0000 I got epoch {1718496052 4}
2024-06-16T00:01:03.000+0000 D [backup/2024-06-16T00:01:03Z] get lock: another operation is running: Snapshot backup '666dde8fac3d7cb8b24a3aaa'
2024-06-16T00:01:03.000+0000 D [backup/2024-06-16T00:01:03Z] skip: lock not acquired
2024-06-16T00:31:45.000+0000 I got command backup [name: 2024-06-16T00:31:45Z, compression: gzip (level: default)] <ts: 1718497905>
2024-06-16T00:31:45.000+0000 I got epoch {1718497893 2}
2024-06-16T00:31:45.000+0000 D [backup/2024-06-16T00:31:45Z] get lock: another operation is running: Snapshot backup '666dde8fac3d7cb8b24a3aaa'
2024-06-16T00:31:45.000+0000 D [backup/2024-06-16T00:31:45Z] skip: lock not acquired
the other two nodes logs
2024-06-15T18:33:52.000+0000 I got command backup [name: 2024-06-15T18:33:51Z, compression: gzip (level: default)] <ts: 1718476431>
2024-06-15T18:33:52.000+0000 I got epoch {1718476420 2}
2024-06-15T18:33:52.000+0000 D [backup/2024-06-15T18:33:51Z] init backup meta
2024-06-15T18:33:52.000+0000 D [backup/2024-06-15T18:33:51Z] nomination list for payroll-rs: [[percona-payroll-payroll-rs-0.percona-payroll-payroll-rs.payroll.svc.cluster.local:27017 percona-payroll-payroll-rs-2.percona-payroll-payroll-rs.payroll.svc.cluster.local:27017] [percona-payroll-payroll-rs-1.percona-payroll-payroll-rs.payroll.svc.cluster.local:27017]]
2024-06-15T18:33:52.000+0000 D [backup/2024-06-15T18:33:51Z] nomination payroll-rs, set candidates [percona-payroll-payroll-rs-0.percona-payroll-payroll-rs.payroll.svc.cluster.local:27017 percona-payroll-payroll-rs-2.percona-payroll-payroll-rs.payroll.svc.cluster.local:27017]
2024-06-15T18:33:57.000+0000 D [backup/2024-06-15T18:33:51Z] bcp nomination: payroll-rs won by percona-payroll-payroll-rs-2.percona-payroll-payroll-rs.payroll.svc.cluster.local:27017
2024-06-15T18:33:57.000+0000 D [backup/2024-06-15T18:33:51Z] skip after nomination, probably started by another node
2024-06-15T18:34:01.000+0000 I [pitr] oplog slicer is paused for lock [Snapshot backup, opid: 666dde8fac3d7cb8b24a3aaa]
2024-06-15T18:34:16.000+0000 I [pitr] oplog slicer is paused for lock [Snapshot backup, opid: 666dde8fac3d7cb8b24a3aaa]
2024-06-15T18:34:31.000+0000 I [pitr] oplog slicer is paused for lock [Snapshot backup, opid: 666dde8fac3d7cb8b24a3aaa]
2024-06-15T18:34:46.000+0000 I [pitr] oplog slicer is paused for lock [Snapshot backup, opid: 666dde8fac3d7cb8b24a3aaa]
2024-06-15T18:33:52.000+0000 I got command backup [name: 2024-06-15T18:33:51Z, compression: gzip (level: default)] <ts: 1718476431>
2024-06-15T18:33:52.000+0000 I got epoch {1718476420 2}
2024-06-15T18:33:52.000+0000 I [pitr] got wake_up signal
2024-06-15T18:33:52.000+0000 I [pitr] wake_up for bcp 666dde8fac3d7cb8b24a3aaa
2024-06-15T18:33:53.000+0000 D [backup/2024-06-15T18:33:51Z] skip after nomination, probably started by another node
2024-06-15T18:34:06.000+0000 I [pitr] oplog slicer is paused for lock [Snapshot backup, opid: 666dde8fac3d7cb8b24a3aaa]
2024-06-15T18:34:21.000+0000 I [pitr] oplog slicer is paused for lock [Snapshot backup, opid: 666dde8fac3d7cb8b24a3aaa]
2024-06-15T18:34:25.000+0000 E [pitr] streaming oplog: get backup start TS: run out of tries
2024-06-15T18:34:36.000+0000 I [pitr] oplog slicer is paused for lock [Snapshot backup, opid: 666dde8fac3d7cb8b24a3aaa]
2024-06-15T18:34:51.000+0000 I [pitr] oplog slicer is paused for lock [Snapshot backup, opid: 666dde8fac3d7cb8b24a3aaa]
2024-06-15T18:35:06.000+0000 I [pitr] oplog slicer is paused for lock [Snapshot backup, opid: 666dde8fac3d7cb8b24a3aaa]
here is my values.yaml
fullnameOverride: percona-test
pause: false
unmanaged: false
clusterServiceDNSMode: "Internal"
crVersion: 1.16.0
multiCluster:
enabled: false
updateStrategy: RollingUpdate
upgradeOptions:
apply: disabled
image:
repository: percona/percona-server-mongodb
tag: 7.0.8
tls:
mode: requireTLS
certValidityDuration: 876000h
allowInvalidCertificates: false
secrets:
encryptionKey: test-rs
users: test-rs
pmm:
enabled: true
serverHost: pmm.pmm.svc.cluster.local
image:
tag: 2.42.0
resources:
limits:
memory: 300Mi
replsets:
rs0:
name: test-rs
size: 3
affinity:
antiAffinityTopologyKey: "kubernetes.io/hostname"
podDisruptionBudget:
maxUnavailable: 1
tolerations:
- effect: NoSchedule
key: stateful
operator: Exists
nodeSelector:
node-role.kubernetes.io/storage: "true"
configuration: |
net:
tls:
allowConnectionsWithoutCertificates: true
expose:
enabled: false
resources:
limits:
memory: "300Mi"
requests:
cpu: "50m"
memory: "300Mi"
volumeSpec:
pvc:
storageClassName: vsphere-csi-xfs
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 3Gi
nonvoting:
enabled: false
arbiter:
enabled: false
sharding:
enabled: false
backup:
enabled: true
image:
repository: percona/percona-backup-mongodb
tag: 2.5.0
# podSecurityContext: {}
# containerSecurityContext: {}
resources:
limits:
memory: 2000Mi
requests:
cpu: 10m
memory: 60Mi
storages:
minio:
type: s3
s3:
region: dit
bucket: percona-mongodb-dev
credentialsSecret: test-rs
retryer:
numMaxRetries: 3
minRetryDelay: 10ms
maxRetryDelay: 5m
endpointUrl: https://s3.dev.krd
prefix: "test"
insecureSkipTLSVerify: false
pitr:
enabled: true
oplogOnly: false
oplogSpanMin: 10
compressionType: gzip
compressionLevel: 6
configuration:
restoreOptions:
batchSize: 500
numInsertionWorkers: 10
numDownloadWorkers: 4
maxDownloadBufferMb: 0
downloadChunkMb: 32
tasks:
- name: minio-3h
enabled: true
schedule: "30 */3 * * *"
keep: 3
storageName: minio
compressionType: gzip
- name: minio-12AM
enabled: true
schedule: "0 0 * * *"
keep: 3
storageName: minio
compressionType: gzip
- name: minio-weekly
enabled: true
schedule: "0 3 */7 * *"
keep: 4
storageName: minio
compressionType: gzip
- name: minio-monthly
enabled: true
schedule: "0 4 */30 * *"
keep: 3
storageName: minio
compressionType: gzip
redeploying the statefulset usually resolves this issue but I want to avoid this issue, any idea what could be causing it?