Percona mongodb operator backup error

Description:

lot of backup in error

Steps to Reproduce:

configure backup to s3 storage

Version:

all

Logs:


Destination: s3://accel-webapp-dev-elog-backup/2024-02-21T00:00:21Z
Error: some of pbm-agents were lost during the backup Last Transition: 2024-02-21T00:00:52Z Pbm Name: 2024-02-21T00:00:21Z Pbm Pod: elog-plus-cluster-rs0-0 Replset Names: rs0

Additional Information:

i have a 3 node replica-set and each backup agent has those log:

2024-02-21T12:00:21.000+0000 D [backup/2024-02-21T12:00:21Z] init backup meta 2024-02-21T12:00:21.000+0000 D [backup/2024-02-21T12:00:21Z] nomination list for rs0: [[elog-plus-cluster-rs0-1.elog-plus-cluster-rs0.elog-plus.svc.cluster.local:27017 elog-plus-cluster-rs0-2.elog-plus-c │
2024-02-21T12:00:21.000+0000 D [backup/2024-02-21T12:00:21Z] nomination rs0, set candidates [elog-plus-cluster-rs0-1.elog-plus-cluster-rs0.elog-plus.svc.cluster.local:27017 elog-plus-cluster-rs0-2.elog-p
2024-02-21T12:00:22.000+0000 D [backup/2024-02-21T12:00:21Z] skip after nomination, probably started by another node
2024-02-21T12:00:26.000+0000 D [pitr] set pitr span to 20m0s
2024-02-21T12:00:26.000+0000 D [backup/2024-02-21T12:00:21Z] bcp nomination: rs0 won by elog-plus-cluster-rs0-1.elog-plus-cluster-rs0.elog-plus.svc.cluster.local:27017


2024-02-21T12:00:28.399+0000 Mux close namespace elogs.mongockChangeLog
2024-02-21T12:00:28.399+0000 done dumping admin.pbmRRoles (2 documents) 2024-02-21T12:00:28.399+0000 Mux close namespace admin.pbmRRoles
2024-02-21T12:00:28.399+0000 done dumping admin.pbmPITRChunks (358 documents)
2024-02-21T12:00:28.399+0000 Mux close namespace admin.pbmPITRChunks
2024/02/21 12:00:28 [entrypoint] pbm-agent exited with code -1 2024/02/21 12:00:28 [entrypoint] restart in 5 sec 2024/02/21 12:00:33 [entrypoint] starting pbm-agent

this is my mognodb deployment for the operator:

apiVersion: psmdb.percona.com/v1
kind: PerconaServerMongoDB
metadata:
  name: elog-plus-cluster
  finalizers:
    - delete-psmdb-pods-in-order
spec:
  crVersion: 1.15.0
  image: percona/percona-server-mongodb:6.0.12-9
  imagePullPolicy: Always
  allowUnsafeConfigurations: false
  updateStrategy: SmartUpdate
  upgradeOptions:
    versionServiceEndpoint: https://check.percona.com
    apply: Disabled
    schedule: "0 2 * * *"
    setFCV: false
  secrets:
    users: mongodb-secret
    encryptionKey: elog-plus-cluster-encryption-key
#    vault: my-cluster-name-vault
  pmm:
    enabled: false
    image: percona/pmm-client:2.39.0
    serverHost: monitoring-service
#    mongodParams: --environment=ENVIRONMENT
#    mongosParams: --environment=ENVIRONMENT
  replsets:

  - name: rs0
    size: 3
    affinity:
      antiAffinityTopologyKey: "kubernetes.io/hostname"
    annotations:
      prometheus.io/scrape: 'true'
      prometheus.io/port: '9216'
      prometheus.io/path: '/metrics'

    sidecars:
    - image: percona/mongodb_exporter:2.37.0
      name: mongodb-exporter
      args: ["--compatible-mode", "--discovering-mode", "--collector.diagnosticdata", "--collector.replicasetstatus","--collector.dbstats", "--collector.topmetrics", "--collector.indexstats", "--mongodb.uri=$(MONGODB_URI)", "--web.listen-address=$(POD_IP):9216"]
      env:
      - name: EXPORTER_USER
        valueFrom:
          secretKeyRef:
            name: mongodb-secret
            key: MONGODB_CLUSTER_MONITOR_USER
      - name: EXPORTER_PASS
        valueFrom:
          secretKeyRef:
            name: mongodb-secret
            key: MONGODB_CLUSTER_MONITOR_PASSWORD
      - name: POD_IP
        valueFrom:
          fieldRef:
            fieldPath: status.podIP
      - name: MONGODB_URI
        value: "mongodb://$(EXPORTER_USER):$(EXPORTER_PASS)@$(POD_IP)/?replicaSet=rs0&authMechanism=SCRAM-SHA-256"
    podDisruptionBudget:
      maxUnavailable: 1
#      minAvailable: 0
    expose:
      enabled: false
      exposeType: ClusterIP
    resources:
      limits:
        cpu: "2"
        memory: "2G"
      requests:
        cpu: "300m"
        memory: "0.5G"
    volumeSpec:
      persistentVolumeClaim:
        resources:
          requests:
            storage: 100Gi

    nonvoting:
      enabled: false
      size: 3
      affinity:
        antiAffinityTopologyKey: "kubernetes.io/hostname"
      podDisruptionBudget:
        maxUnavailable: 1
#        minAvailable: 0
      resources:
        limits:
          cpu: "300m"
          memory: "0.5G"
        requests:
          cpu: "300m"
          memory: "0.5G"
      volumeSpec:
        persistentVolumeClaim:
          resources:
            requests:
              storage: 10Gi
    arbiter:
      enabled: false
      size: 1
      affinity:
        antiAffinityTopologyKey: "kubernetes.io/hostname"

  sharding:
    enabled: false
    configsvrReplSet:
      size: 3
      affinity:
        antiAffinityTopologyKey: "kubernetes.io/hostname"
      podDisruptionBudget:
        maxUnavailable: 1
      expose:
        enabled: false
        exposeType: ClusterIP
      resources:
        limits:
          cpu: "300m"
          memory: "0.5G"
        requests:
          cpu: "300m"
          memory: "0.5G"
      volumeSpec:
        persistentVolumeClaim:
          resources:
            requests:
              storage: 3Gi

    mongos:
      size: 3
      affinity:
        antiAffinityTopologyKey: "kubernetes.io/hostname"
      podDisruptionBudget:
        maxUnavailable: 1
      resources:
        limits:
          cpu: "300m"
          memory: "0.5G"
        requests:
          cpu: "300m"
          memory: "0.5G"
      expose:
        exposeType: ClusterIP

  backup:
    enabled: true
    image: percona/percona-backup-mongodb:2.3.1
    serviceAccountName: percona-server-mongodb-operator
    resources:
      limits:
        cpu: "300m"
        memory: "0.5G"
      requests:
        cpu: "300m"
        memory: "0.5G"
    storages:
      s3-tid:
        type: s3
        s3:
          bucket: accel-webapp-dev-elog-backup
          credentialsSecret: s3-backup-secret
          region: us-west-1
          prefix: ""
          uploadPartSize: 10485760
          maxUploadParts: 10000
          storageClass: STANDARD
          endpointUrl: https://s3dfrgw.slac.stanford.edu
    pitr:
      enabled: true
      oplogSpanMin: 20
      compressionType: gzip
      compressionLevel: 6
    tasks:
      - name: daily-s3-tid
        enabled: true
        schedule: "0 */12 * * *"
        keep: 3
        storageName: s3-tid
        compressionType: gzip
        compressionLevel: 6
#      - name: weekly-s3-us-west
#        enabled: false
#        schedule: "0 0 * * 0"
#        keep: 5
#        storageName: s3-us-west
#        compressionType: gzip
#        compressionLevel: 6
#      - name: weekly-s3-us-west-physical
#        enabled: false
#        schedule: "0 5 * * 0"
#        keep: 5
#        type: physical
#        storageName: s3-us-west
#        compressionType: gzip
#        compressionLevel: 6

this is the log of on backups goes in error:

Status:                                                                                                                                                                                                     
 Destination:      s3://accel-webapp-dev-elog-backup/2024-02-21T12:00:21Z                                                                                                                                  
 Error:            some of pbm-agents were lost during the backup                                                                                                                                          
 Last Transition:  2024-02-21T12:01:01Z                                                                                                                                                                     
 Pbm Name:         2024-02-21T12:00:21Z                                                                                                                                                                    
 Pbm Pod:          elog-plus-cluster-rs0-2                                                                                                                                                                 
 Replset Names:                                                                                                                                                                                            
  rs0                                                                                                                                                                                                     
 s3:                                                                                                                                                                                                       
  Bucket:              accel-webapp-dev-elog-backup                                                                                                                                                       
  Credentials Secret:  s3-backup-secret                                                                                                                                                                   
  Endpoint URL:        https://s3dfrgw.slac.stanford.edu                                                                                                                                                  
  Max Upload Parts:    10000                                                                                                                                                                              
  Region:              us-west-1                                                                                                                                                                          
  Server Side Encryption:                                                                                                                                                                                 
  Storage Class:     STANDARD                                                                                                                                                                             
  Upload Part Size:  10485760

i got this error on the agent that performs the backeup:

2024-02-21T18:25:10.000+0000 I got command backup [name: 2024-02-21T18:25:09Z, compression: gzip (level: default)] <ts: 1708539909>                                                                                                                                    
2024-02-21T18:25:10.000+0000 I got epoch {1708539898 8}                                                                                                                                                                                                                
2024-02-21T18:25:10.000+0000 I [backup/2024-02-21T18:25:09Z] backup started                                                                                                                                                                                            
2024-02-21T18:25:13.000+0000 D [backup/2024-02-21T18:25:09Z] wait for tmp users {1708539913 13}                                                                                                                                                                        
2024-02-21T18:25:13.601+0000    Setting num cpus to 60                                                                                                                                                                                                                 
2024/02/21 18:25:13 [entrypoint] `pbm-agent` exited with code -1