Unable to Start MongoDB After Resources Update, Possible Data Corruption

Description:

We’ve been using the Percona operator for MongoDB for many months now with success, not had an issue & backup/restore process working well. Yesterday we tried to increase the resources of our mongodb server but once the configuration change was applied – mongodb just fails to start.

We’re currently trying to restore a backup, but this issue seems to be really worrying if we’re to continue using the operator. I hope this was some error on our part.

Also tried booting up the server without sharding but no luck. And of course, tried letting it run for 15-20 minutes but no luck.

Steps to Reproduce:

This has never happened before and is the first time we’re seeing this. We’ve updated the resources, scaled up the disk many times without any issues. I’ve attached our configuration yaml (the change was to increase 4 vcpus to 6, and memory from 8gb to 12)

apiVersion: psmdb.percona.com/v1
kind: PerconaServerMongoDB
metadata:
  name: mongodb
  namespace: mongodb
spec:
  crVersion: 1.19.0
  image: percona/percona-server-mongodb:8.0.4-1-multi
  tls:
    mode: disabled
  unsafeFlags:
    tls: true
    replsetSize: true
    mongosSize: true
  upgradeOptions:
    apply: disabled
    schedule: "0 2 * * *"
  secrets:
    users: mongodb
  replsets:
  - name: rs0
    size: 1
    affinity:
      antiAffinityTopologyKey: 'none'
    # if using even number of nodes, set arbiter to true
    arbiter:
      enabled: false
      size: 1
    configuration: |
      replication:
        # 25GB
        oplogSizeMB: 25800
      operationProfiling:
        slowOpThresholdMs: 10000
        mode: slowOp
        rateLimit: 100
    podDisruptionBudget:
      maxUnavailable: 1
    resources:
      limits:
        cpu: '6'
        memory: '12Gi'
      requests:
        cpu: '4'
        memory: 8Gi
    volumeSpec:
      persistentVolumeClaim:
        resources:
          requests:
            storage: 550Gi
        storageClassName: gp2-xfs
  sharding:
    enabled: true

    configsvrReplSet:
      size: 1
      resources:
        limits:
          cpu: '2'
          memory: 2048Mi
        requests:
          cpu: '1'
          memory: 1536Mi
      volumeSpec:
        persistentVolumeClaim:
          resources:
            requests:
              storage: 25Gi
          storageClassName: gp2-xfs

    mongos:
      size: 1
  users:
  - name: <user>
    db: <db>
    passwordSecretRef:
      name: password
      key: password
    roles:
      - name: dbOwner
        db: <db>
      - name: read
        db: local
      - name: read
        db: config
      - name: clusterMonitor
        db: admin
      - name: readAnyDatabase
        db: admin
  backup:
    enabled: false
    image: percona/percona-backup-mongodb:2.8.0-multi
    pitr:
      enabled: true
      oplogOnly: true
    storages:
      s3-bkp:
        type: s3
        s3:
          bucket:  <bucket-name>
          region: ap-southeast-1
          credentialsSecret: <name>
    resources:
      limits:
        cpu: 1
        memory: 2Gi
      requests:
        cpu: 500m
        memory: 1Gi
    tasks:
    - name: "backup-task"
      enabled: true
      # At 0 minutes past the hour, every 12 hours, starting at 10:00 AM
      # https://crontab.cronhub.io
      schedule: "0 10/12 * * *"
      keep: 8
      type: physical
      storageName: s3-bkp

Version:

1.19.0

Logs:

mongo-logs.txt (653.1 KB)

Hi @Adhiraj_Singh I think you have big data, and you need to increase the termination grace period via the terminationGracePeriodSeconds option to be safe next time when you restart your cluster. Please see Jira

I see, thank you. What value would you recommend for a 0.5 TB dataset? Also, what’s the default value