Percona MongoDB in crash loop

Description:

Primary node on the replica set has crashed for some reason and is unable to start. It scans the collection for index building and gets signal(15) to be killed, after which, it shuts down and repeats the process once again.

We are running percona/percona-server-mongodb:6.0.9-7.

Here are the logs from mongod of failing node:

{"t":{"$date":"2024-04-18T07:02:56.003+00:00"},"s":"I",  "c":"-",        "id":51773,   "ctx":"initandlisten","msg":"progress meter","attr":{"name":"Index Build: scanning collection","done":127049100,"total":234295627,"percent":54}}
{"t":{"$date":"2024-04-18T07:02:59.003+00:00"},"s":"I",  "c":"-",        "id":51773,   "ctx":"initandlisten","msg":"progress meter","attr":{"name":"Index Build: scanning collection","done":127904300,"total":234295627,"percent":54}}
{"t":{"$date":"2024-04-18T07:03:02.003+00:00"},"s":"I",  "c":"-",        "id":51773,   "ctx":"initandlisten","msg":"progress meter","attr":{"name":"Index Build: scanning collection","done":129831800,"total":234295627,"percent":55}}
{"t":{"$date":"2024-04-18T07:03:03.453+00:00"},"s":"I",  "c":"CONTROL",  "id":23377,   "ctx":"SignalHandler","msg":"Received signal","attr":{"signal":15,"error":"Terminated"}}
{"t":{"$date":"2024-04-18T07:03:03.453+00:00"},"s":"I",  "c":"CONTROL",  "id":23378,   "ctx":"SignalHandler","msg":"Signal was sent by kill(2)","attr":{"pid":0,"uid":0}}
{"t":{"$date":"2024-04-18T07:03:03.453+00:00"},"s":"I",  "c":"CONTROL",  "id":23381,   "ctx":"SignalHandler","msg":"will terminate after current cmd ends"}
{"t":{"$date":"2024-04-18T07:03:03.453+00:00"},"s":"I",  "c":"REPL",     "id":4784900, "ctx":"SignalHandler","msg":"Stepping down the ReplicationCoordinator for shutdown","attr":{"waitTimeMillis":15000}}
{"t":{"$date":"2024-04-18T07:03:03.454+00:00"},"s":"I",  "c":"REPL",     "id":4794602, "ctx":"SignalHandler","msg":"Attempting to enter quiesce mode"}
{"t":{"$date":"2024-04-18T07:03:03.454+00:00"},"s":"I",  "c":"-",        "id":6371601, "ctx":"SignalHandler","msg":"Shutting down the FLE Crud thread pool"}
{"t":{"$date":"2024-04-18T07:03:03.454+00:00"},"s":"I",  "c":"COMMAND",  "id":4784901, "ctx":"SignalHandler","msg":"Shutting down the MirrorMaestro"}
{"t":{"$date":"2024-04-18T07:03:03.454+00:00"},"s":"I",  "c":"SHARDING", "id":4784902, "ctx":"SignalHandler","msg":"Shutting down the WaitForMajorityService"}
{"t":{"$date":"2024-04-18T07:03:03.454+00:00"},"s":"I",  "c":"NETWORK",  "id":20562,   "ctx":"SignalHandler","msg":"Shutdown: going to close listening sockets"}
{"t":{"$date":"2024-04-18T07:03:03.454+00:00"},"s":"I",  "c":"NETWORK",  "id":4784905, "ctx":"SignalHandler","msg":"Shutting down the global connection pool"}
{"t":{"$date":"2024-04-18T07:03:03.454+00:00"},"s":"I",  "c":"CONTROL",  "id":4784906, "ctx":"SignalHandler","msg":"Shutting down the FlowControlTicketholder"}
{"t":{"$date":"2024-04-18T07:03:03.454+00:00"},"s":"I",  "c":"-",        "id":20520,   "ctx":"SignalHandler","msg":"Stopping further Flow Control ticket acquisitions."}
{"t":{"$date":"2024-04-18T07:03:03.454+00:00"},"s":"I",  "c":"REPL",     "id":4784907, "ctx":"SignalHandler","msg":"Shutting down the replica set node executor"}
{"t":{"$date":"2024-04-18T07:03:03.454+00:00"},"s":"I",  "c":"CONTROL",  "id":4784908, "ctx":"SignalHandler","msg":"Shutting down the PeriodicThreadToAbortExpiredTransactions"}
{"t":{"$date":"2024-04-18T07:03:03.454+00:00"},"s":"I",  "c":"REPL",     "id":4784909, "ctx":"SignalHandler","msg":"Shutting down the ReplicationCoordinator"}
{"t":{"$date":"2024-04-18T07:03:03.454+00:00"},"s":"I",  "c":"REPL",     "id":5074000, "ctx":"SignalHandler","msg":"Shutting down the replica set aware services."}
{"t":{"$date":"2024-04-18T07:03:03.454+00:00"},"s":"I",  "c":"REPL",     "id":5123006, "ctx":"SignalHandler","msg":"Shutting down PrimaryOnlyService","attr":{"service":"TenantMigrationDonorService","numInstances":0,"numOperationContexts":0}}
{"t":{"$date":"2024-04-18T07:03:03.454+00:00"},"s":"I",  "c":"REPL",     "id":5123006, "ctx":"SignalHandler","msg":"Shutting down PrimaryOnlyService","attr":{"service":"ShardSplitDonorService","numInstances":0,"numOperationContexts":0}}
{"t":{"$date":"2024-04-18T07:03:03.454+00:00"},"s":"I",  "c":"REPL",     "id":5123006, "ctx":"SignalHandler","msg":"Shutting down PrimaryOnlyService","attr":{"service":"TenantMigrationRecipientService","numInstances":0,"numOperationContexts":0}}
{"t":{"$date":"2024-04-18T07:03:03.454+00:00"},"s":"I",  "c":"REPL",     "id":21328,   "ctx":"SignalHandler","msg":"Shutting down replication subsystems"}
{"t":{"$date":"2024-04-18T07:03:03.454+00:00"},"s":"W",  "c":"REPL",     "id":21409,   "ctx":"SignalHandler","msg":"ReplicationCoordinatorImpl::shutdown() called before startup() finished. Shutting down without cleaning up the replication system"}
{"t":{"$date":"2024-04-18T07:03:03.454+00:00"},"s":"I",  "c":"SHARDING", "id":4784910, "ctx":"SignalHandler","msg":"Shutting down the ShardingInitializationMongoD"}
{"t":{"$date":"2024-04-18T07:03:03.454+00:00"},"s":"I",  "c":"REPL",     "id":4784911, "ctx":"SignalHandler","msg":"Enqueuing the ReplicationStateTransitionLock for shutdown"}
{"t":{"$date":"2024-04-18T07:03:03.454+00:00"},"s":"I",  "c":"-",        "id":4784912, "ctx":"SignalHandler","msg":"Killing all operations for shutdown"}
{"t":{"$date":"2024-04-18T07:03:03.454+00:00"},"s":"I",  "c":"-",        "id":4695300, "ctx":"SignalHandler","msg":"Interrupted all currently running operations","attr":{"opsKilled":4}}
{"t":{"$date":"2024-04-18T07:03:03.454+00:00"},"s":"I",  "c":"TENANT_M", "id":5093807, "ctx":"SignalHandler","msg":"Shutting down all TenantMigrationAccessBlockers on global shutdown"}
{"t":{"$date":"2024-04-18T07:03:03.454+00:00"},"s":"I",  "c":"COMMAND",  "id":4784913, "ctx":"SignalHandler","msg":"Shutting down all open transactions"}
{"t":{"$date":"2024-04-18T07:03:03.455+00:00"},"s":"I",  "c":"REPL",     "id":4784914, "ctx":"SignalHandler","msg":"Acquiring the ReplicationStateTransitionLock for shutdown"}
{"t":{"$date":"2024-04-18T07:03:04.333+00:00"},"s":"I",  "c":"INDEX",    "id":4984704, "ctx":"initandlisten","msg":"Index build: collection scan stopped","attr":{"buildUUID":null,"collectionUUID":{"uuid":{"$uuid":"578cce8a-3d2f-49c3-8e7a-32c0447b3ffd"}},"totalRecords":131103284,"durationMillis":234000,"phase":"collection scan","collectionScanPosition":"131103284","readSource":"kNoTimestamp","error":{"code":11600,"codeName":"InterruptedAtShutdown","errmsg":"interrupted at shutdown"}}}
{"t":{"$date":"2024-04-18T07:03:04.333+00:00"},"s":"I",  "c":"STORAGE",  "id":22206,   "ctx":"initandlisten","msg":"Deferring table drop for index","attr":{"index":"_id_","namespace":"namespace.collection","uuid":{"uuid":{"$uuid":"578cce8a-3d2f-49c3-8e7a-32c0447b3ffd"}},"ident":"index-0-1349749316917450711","commitTimestamp":{"$timestamp":{"t":0,"i":0}}}}
{"t":{"$date":"2024-04-18T07:03:04.506+00:00"},"s":"E",  "c":"STORAGE",  "id":21021,   "ctx":"initandlisten","msg":"Could not build an _id index on collection","attr":{"namespace":"namespace.collection","uuid":{"uuid":{"$uuid":"578cce8a-3d2f-49c3-8e7a-32c0447b3ffd"}},"error":{"code":11600,"codeName":"InterruptedAtShutdown","errmsg":"collection scan stopped. totalRecords: 131103284; durationMillis: 234000ms; phase: collection scan; collectionScanPosition: RecordId(131103284); readSource: kNoTimestamp :: caused by :: interrupted at shutdown"}}}
{"t":{"$date":"2024-04-18T07:03:04.507+00:00"},"s":"F",  "c":"CONTROL",  "id":20573,   "ctx":"initandlisten","msg":"Wrong mongod version","attr":{"error":"UPGRADE PROBLEM: The data files need to be fully upgraded to version 4.4 before attempting a binary upgrade; see https://docs.mongodb.com/master/release-notes/5.0/#upgrade-procedures for more details."}}
{"t":{"$date":"2024-04-18T07:03:04.507+00:00"},"s":"I",  "c":"CONTROL",  "id":23139,   "ctx":"initandlisten","msg":"Conflicting exit code at shutdown","attr":{"originalExitCode":0,"newExitCode":62}}
{"t":{"$date":"2024-04-18T07:03:04.507+00:00"},"s":"I",  "c":"INDEX",    "id":4784915, "ctx":"SignalHandler","msg":"Shutting down the IndexBuildsCoordinator"}
{"t":{"$date":"2024-04-18T07:03:04.507+00:00"},"s":"I",  "c":"NETWORK",  "id":4784918, "ctx":"SignalHandler","msg":"Shutting down the ReplicaSetMonitor"}
{"t":{"$date":"2024-04-18T07:03:04.507+00:00"},"s":"I",  "c":"SHARDING", "id":4784921, "ctx":"SignalHandler","msg":"Shutting down the MigrationUtilExecutor"}
{"t":{"$date":"2024-04-18T07:03:04.507+00:00"},"s":"I",  "c":"ASIO",     "id":22582,   "ctx":"MigrationUtil-TaskExecutor","msg":"Killing all outstanding egress activity."}
{"t":{"$date":"2024-04-18T07:03:04.507+00:00"},"s":"I",  "c":"COMMAND",  "id":4784923, "ctx":"SignalHandler","msg":"Shutting down the ServiceEntryPoint"}
{"t":{"$date":"2024-04-18T07:03:04.507+00:00"},"s":"I",  "c":"CONTROL",  "id":4784925, "ctx":"SignalHandler","msg":"Shutting down free monitoring"}
{"t":{"$date":"2024-04-18T07:03:04.507+00:00"},"s":"I",  "c":"CONTROL",  "id":4784927, "ctx":"SignalHandler","msg":"Shutting down the HealthLog"}
{"t":{"$date":"2024-04-18T07:03:04.507+00:00"},"s":"I",  "c":"CONTROL",  "id":4784928, "ctx":"SignalHandler","msg":"Shutting down the TTL monitor"}
{"t":{"$date":"2024-04-18T07:03:04.507+00:00"},"s":"I",  "c":"CONTROL",  "id":6278511, "ctx":"SignalHandler","msg":"Shutting down the Change Stream Expired Pre-images Remover"}
{"t":{"$date":"2024-04-18T07:03:04.507+00:00"},"s":"I",  "c":"CONTROL",  "id":4784929, "ctx":"SignalHandler","msg":"Acquiring the global lock for shutdown"}
{"t":{"$date":"2024-04-18T07:03:04.507+00:00"},"s":"I",  "c":"CONTROL",  "id":4784930, "ctx":"SignalHandler","msg":"Shutting down the storage engine"}
{"t":{"$date":"2024-04-18T07:03:04.508+00:00"},"s":"I",  "c":"STORAGE",  "id":22320,   "ctx":"SignalHandler","msg":"Shutting down journal flusher thread"}
{"t":{"$date":"2024-04-18T07:03:04.508+00:00"},"s":"I",  "c":"STORAGE",  "id":22321,   "ctx":"SignalHandler","msg":"Finished shutting down journal flusher thread"}
{"t":{"$date":"2024-04-18T07:03:04.508+00:00"},"s":"I",  "c":"STORAGE",  "id":22322,   "ctx":"SignalHandler","msg":"Shutting down checkpoint thread"}
{"t":{"$date":"2024-04-18T07:03:04.508+00:00"},"s":"I",  "c":"STORAGE",  "id":22323,   "ctx":"SignalHandler","msg":"Finished shutting down checkpoint thread"}
{"t":{"$date":"2024-04-18T07:03:04.508+00:00"},"s":"I",  "c":"STORAGE",  "id":20282,   "ctx":"SignalHandler","msg":"Deregistering all the collections"}
{"t":{"$date":"2024-04-18T07:03:04.508+00:00"},"s":"I",  "c":"STORAGE",  "id":22372,   "ctx":"OplogVisibilityThread","msg":"Oplog visibility thread shutting down."}
{"t":{"$date":"2024-04-18T07:03:04.508+00:00"},"s":"I",  "c":"STORAGE",  "id":22317,   "ctx":"SignalHandler","msg":"WiredTigerKVEngine shutting down"}
{"t":{"$date":"2024-04-18T07:03:04.508+00:00"},"s":"I",  "c":"STORAGE",  "id":22318,   "ctx":"SignalHandler","msg":"Shutting down session sweeper thread"}
{"t":{"$date":"2024-04-18T07:03:04.508+00:00"},"s":"I",  "c":"STORAGE",  "id":22319,   "ctx":"SignalHandler","msg":"Finished shutting down session sweeper thread"}
{"t":{"$date":"2024-04-18T07:03:04.515+00:00"},"s":"I",  "c":"STORAGE",  "id":4795902, "ctx":"SignalHandler","msg":"Closing WiredTiger","attr":{"closeConfig":"leak_memory=true,"}}
{"t":{"$date":"2024-04-18T07:03:04.886+00:00"},"s":"I",  "c":"STORAGE",  "id":4795901, "ctx":"SignalHandler","msg":"WiredTiger closed","attr":{"durationMillis":371}}
{"t":{"$date":"2024-04-18T07:03:05.156+00:00"},"s":"I",  "c":"STORAGE",  "id":22279,   "ctx":"SignalHandler","msg":"shutdown: removing fs lock..."}
{"t":{"$date":"2024-04-18T07:03:05.169+00:00"},"s":"I",  "c":"-",        "id":4784931, "ctx":"SignalHandler","msg":"Dropping the scope cache for shutdown"}
{"t":{"$date":"2024-04-18T07:03:05.169+00:00"},"s":"I",  "c":"CONTROL",  "id":20565,   "ctx":"SignalHandler","msg":"Now exiting"}
{"t":{"$date":"2024-04-18T07:03:05.170+00:00"},"s":"I",  "c":"CONTROL",  "id":23138,   "ctx":"SignalHandler","msg":"Shutting down","attr":{"exitCode":0}}

As you can see from the logs, there is one interesting line:

{"t":{"$date":"2024-04-18T07:03:04.507+00:00"},"s":"F",  "c":"CONTROL",  "id":20573,   "ctx":"initandlisten","msg":"Wrong mongod version","attr":{"error":"UPGRADE PROBLEM: The data files need to be fully upgraded to version 4.4 before attempting a binary upgrade; see https://docs.mongodb.com/master/release-notes/5.0/#upgrade-procedures for more details."}}

Not sure why do we get this error, but perhaps the database was in the process of updating before the crash?

Also, the output of kubectl get psmdb perconacluster -o yaml -n perconamongodb

  - lastTransitionTime: "2024-04-18T02:07:46Z"
    status: "True"
    type: initializing
  - lastTransitionTime: "2024-04-18T02:08:02Z"
    status: "True"
    type: ready
  - lastTransitionTime: "2024-04-18T05:14:29Z"
    message: 'create pbm object: create PBM connection to perconamongodbcluster-rs0-0.perconamongodbcluster-rs0.perconamongodb.svc.cluster.local:27017,perconamongodbcluster-rs0-1.perconamongodbcluster-rs0.perconamongodb.svc.cluster.local:27017,perconamongodbcluster-rs0-2.perconamongodbcluster-rs0.perconamongodb.svc.cluster.local:27017:
      setup a new backups db: ensure cmd collection: connection(perconamongodbcluster-rs0-0.perconamongodbcluster-rs0.perconamongodb.svc.cluster.local:27017[-269065])
      socket was unexpectedly closed: EOF'
    reason: ErrorReconcile
    status: "True"
    type: error
  - lastTransitionTime: "2024-04-18T05:14:38Z"
    status: "True"
    type: initializing
  - lastTransitionTime: "2024-04-18T05:17:12Z"
    message: 'rs0: ready'
    reason: RSReady
    status: "True"
    type: ready
  - lastTransitionTime: "2024-04-18T05:17:12Z"
    status: "True"
    type: initializing
  host: perconamongodbcluster-rs0.perconamongodb.svc.cluster.local
  mongoImage: percona/percona-server-mongodb:6.0.9-7
  mongoVersion: 6.0.9-7
  observedGeneration: 64
  pmmVersion: 2.39.0
  ready: 2
  replsets:
    rs0:
      initialized: true
      ready: 2
      size: 3
      status: initializing
  size: 3
  state: initializing

This contains an interesting state transition, which I was not able to figure out. Maybe you could help me understand why did this happened? This is the exact time when problems with the database has started.

1 Like

Hi @Gvidas_Pranauskas , First of all, we need to understand what was done with your DB cluster.
Did you perform the major update? Or maybe you restored the backup to your PSMDB cluster. Could you please provide more detailed information for us?

Hi @Slava_Sarzhan, in fact, I did not change anything related to the version. I did not try to perform a major updated of the database.

Interestingly enough, I have just checked the history of backups:

NAME                                         CLUSTER                 STORAGE      DESTINATION                                   TYPE       STATUS   COMPLETED   AGE
backup-2024-03-03                            perconamongodbcluster   azure-blob   azure://mongodb-backup/2024-03-03T02:24:42Z   physical   ready    46d         46d
cron-perconamongodbcl-20240309000000-ddst6   perconamongodbcluster   azure-blob   azure://mongodb-backup/2024-03-09T00:00:21Z   logical    ready    40d         40d
cron-perconamongodbcl-20240316000000-2gpk8   perconamongodbcluster   azure-blob   azure://mongodb-backup/2024-03-16T00:00:21Z   logical    error                33d
cron-perconamongodbcl-20240323000000-g9mhn   perconamongodbcluster   azure-blob   azure://mongodb-backup/2024-03-23T00:00:21Z   logical    error                26d
cron-perconamongodbcl-20240330050000-g8qtl   perconamongodbcluster   azure-blob   azure://mongodb-backup/2024-03-30T05:00:21Z   logical    ready    18d         19d
cron-perconamongodbcl-20240401000000-5s8l7   perconamongodbcluster   azure-blob   azure://mongodb-backup/2024-04-01T00:00:21Z   physical   ready    17d         17d
cron-perconamongodbcl-20240406050000-sn4zg   perconamongodbcluster   azure-blob                                                            error                12d
cron-perconamongodbcl-20240413050000-kjgp5   perconamongodbcluster   azure-blob                                                            error                5d2h
initial-data-backup                          perconamongodbcluster   azure-blob   azure://mongodb-backup/2024-02-18T14:41:54Z   physical   ready    59d         59d
logical-backup-2024-03-23                    perconamongodbcluster   azure-blob   azure://mongodb-backup/2024-03-23T14:15:53Z   logical    ready    25d         25d

I have never restored a backup on my PSMDB cluster, so I have never had the need to check for possible backups. However, it seems that the backups are not always successful, I have no idea on why they might have failed.

@Gvidas_Pranauskas Was only one pod restarted, or were all pods restarted in your cluster?

Also, try to check your restoration objects to be sure that you did not have any restores:

kubectl get psmdb-restore

Try to check evens to understand what is going on with your deployment:
kubectl get events --sort-by=.metadata.creationTimestamp

Only the one pod, which was initially the primary pod, got restarted. The two remaining pods took over - one of them became primary - and keeps running fine, serving all the operations

I have checked the restoration objects:

NAME       CLUSTER                 STATUS   AGE
restore1   perconamongodbcluster   error    46d

Now that I remember, it was done a while ago, just as a test case. Did it interfere somehow?

Regarding the events, there is nothing returned in the namespace of percona mongodb deployment:

kubectl events -n perconamongodb
No events found in perconamongodb namespace.

Overall, it seems that the database somehow crashed and is not getting backup. My plan is to delete PVC of the failed node, copy data from working nodes and run the database again. However, it seems that his problem keeps reoccurring for us. I have done this process over 5 times now already. The database seems to be not stable when 3 nodes are running in Primary-Secondary-Secondary configuration. Whenever one node gets down, the two remaining nodes seems to run fine and without any problems. Might this be related to the extensive load we put to MongoDB? Each node is running on 32GB RAM instance, and it gets pretty close to the limit of memory. Our dataset is also quite big and extensive, some collections have >10 million of documents. Should I try to disable the smartUpgrade?
This is the current configuration we are using for update and upgrade options:

updateStrategy: SmartUpdate
upgradeOptions:
  versionServiceEndpoint: https://check.percona.com
  apply: Recommended
  schedule: "0 2 * * *"
  setFCV: false

Can you provide a full CR for us? Also, please try to get psmdb image which is used by pods which was not restarted:

kubectl get pods <pod_name> -o yaml

And please check operator’s log.

PSMDB image version of pod which has been running without interruptions:

image: percona/percona-server-mongodb:6.0.9-7

full CR:

apiVersion: psmdb.percona.com/v1
kind: PerconaServerMongoDB
metadata:
  name: perconamongodbcluster
  finalizers:
    - delete-psmdb-pods-in-order
spec:
  pause: false
  pmm:
    enabled: true
    image: percona/pmm-client:2.41.1
    serverHost: monitoring-service
  crVersion: 1.15.0
  image: percona/percona-server-mongodb:6.0.9-7
  imagePullPolicy: Always
  tls:
    certValidityDuration: 2160h
  allowUnsafeConfigurations: false
  updateStrategy: SmartUpdate
  upgradeOptions:
    versionServiceEndpoint: https://check.percona.com
    apply: Recommended
    schedule: "0 2 * * *"
    setFCV: false
  secrets:
    users: perconamongodbcluster-secrets
    encryptionKey: perconamongodbcluster-mongodb-encryption-key
  replsets:
  - name: rs0
    size: 3
    topologySpreadConstraints:
      - labelSelector:
          matchLabels:
            app.kubernetes.io/name: percona-server-mongodb
        maxSkew: 1
        topologyKey: kubernetes.io/hostname
        whenUnsatisfiable: ScheduleAnyway
    nodeSelector:
      agentpool: databaseiops
    affinity:
      antiAffinityTopologyKey: "kubernetes.io/hostname"
    tolerations:
    - key: "node.alpha.kubernetes.io/unreachable"
      operator: "Exists"
      effect: "NoExecute"
      tolerationSeconds: 6000
    storage:
      engine: wiredTiger
      wiredTiger:
        engineConfig:
          cacheSizeRatio: 0.8
    podDisruptionBudget:
      maxUnavailable: 1
    expose:
      enabled: false
    resources:
      limits:
        cpu: "2500m"
        memory: "25G"
      requests:
        cpu: "1000m"
        memory: "10G"
    livenessProbe:
      exec:
        command:
          - /opt/percona/mongodb-healthcheck
          - k8s
          - liveness
          - '--ssl'
          - '--sslInsecure'
          - '--sslCAFile'
          - /etc/mongodb-ssl/ca.crt
          - '--sslPEMKeyFile'
          - /tmp/tls.pem
          - '--startupDelaySeconds'
          - '7200'
      initialDelaySeconds: 120
      timeoutSeconds: 80
      periodSeconds: 120
      successThreshold: 1
      failureThreshold: 32
    readinessProbe:
      exec:
        command:
          - /opt/percona/mongodb-healthcheck
          - k8s
          - readiness
          - '--component'
          - mongod
      initialDelaySeconds: 90
      timeoutSeconds: 60
      periodSeconds: 30
      successThreshold: 1
      failureThreshold: 24
    volumeSpec:
      persistentVolumeClaim:
        storageClassName: mongodb-premium-storageclass
        accessModes: [ "ReadWriteOnce" ]
        resources:
          requests:
            storage: 998Gi
  backup:
    enabled: true
    image: percona/percona-backup-mongodb:2.3.1
    serviceAccountName: percona-server-mongodb-operator
    resources:
      limits:
        cpu: "1500m"
        memory: "2G"
      requests:
        cpu: "500m"
        memory: "0.5G"
    storages:
      azure-blob:
        type: azure
        azure:
          container: mongodb-backup
          credentialsSecret: perconamongodb-cluster-azure-secrets
    tasks:
    - name: "monthly-backup"
      enabled: true
      schedule: "0 0 1 * *"
      keep: 3
      type: physical
      storageName: azure-blob
    - name: "weekly-backup"
      enabled: true
      schedule: "0 5 * * 6"
      keep: 4
      type: logical
      storageName: azure-blob
    pitr:
      enabled: true

in the logs of operator, this can be found:

2024-04-18T09:05:02.412Z        INFO    StatefulSet is changed, starting smart update   {"controller": "psmdb-controller", "object": {"name":"perconamongodbcluster","namespace":"perconamongodb"}, "namespace": "perconamongodb", "name": "perconamongodbcluster", "reconcileID": "42d9d6a6-6862-42d2-9ae6-4aea4d2d13da", "name": "perconamongodbcluster-rs0"}
2024-04-18T09:05:02.412Z        INFO    can't start/continue 'SmartUpdate': waiting for all replicas are ready  {"controller": "psmdb-controller", "object": {"name":"perconamongodbcluster","namespace":"perconamongodb"}, "namespace": "perconamongodb", "name": "perconamongodbcluster", "reconcileID": "42d9d6a6-6862-42d2-9ae6-4aea4d2d13da"}
2024-04-18T09:05:02.503Z        INFO    StatefulSet is not up to date   {"controller": "psmdb-controller", "object": {"name":"perconamongodbcluster","namespace":"perconamongodb"}, "namespace": "perconamongodb", "name": "perconamongodbcluster", "reconcileID": "42d9d6a6-6862-42d2-9ae6-4aea4d2d13da", "sts": "perconamongodbcluster-rs0"}
2024-04-18T09:05:02.503Z        INFO    StatefulSet is not up to date   {"controller": "psmdb-controller", "object": {"name":"perconamongodbcluster","namespace":"perconamongodb"}, "namespace": "perconamongodb", "name": "perconamongodbcluster", "reconcileID": "42d9d6a6-6862-42d2-9ae6-4aea4d2d13da", "sts": "perconamongodbcluster-rs0"}

I have messed with the YAML of StatefulSet directly, trying to increase livenessProbe and readinessProbe thresholds, guessing the problem was there. Perhaps this could have made the statefulset not up to date. But still, it correctly reports the rs0-0 pod being not healthy

EDIT: I have been thinking to change the image of the database to debug, trying to figure out what’s going on, but never did that. Just the CR I have copied had this change, but never applied it, removed it out with this edit

Operator logs do not report rs0-0 pod being not healthy, but the cluster rs0 being out of date - my bad

Did you create this DB cluster using 6.0.9-7 from the beginning? Because this message is really strange

{"t":{"$date":"2024-04-18T07:03:04.507+00:00"},"s":"F",  "c":"CONTROL",  "id":20573,   "ctx":"initandlisten","msg":"Wrong mongod version","attr":{"error":"UPGRADE PROBLEM: The data files need to be fully upgraded to version 4.4 before attempting a binary upgrade; see https://docs.mongodb.com/master/release-notes/5.0/#upgrade-procedures for more details."}}

It shows that you had maybe even PSMDB 4.2, and the update was not completed normally. Do you know something about it?

I have cloned this repo locally, with v1.15.0 tag available and took the configuration as it is in the deploy/cr.yaml file: percona-server-mongodb-operator/deploy/cr.yaml at d938d4bc38bcc40a7cefc6627945550893fd17d8 · percona/percona-server-mongodb-operator · GitHub

I have updated just the secrets, cluster name and backups (azure blob storage).

Hi,
perhaps you have any ideas what might gone wrong?
I am interested in why the MongoDB restarts itself in the first place, and after the restart, what is the reason it fails to build indexes and just enters crashloop.