Psmdb operator fails to complete smart update

Slavisa_Milojkovic · October 4, 2023, 3:07pm

Description:

Psmdb operator fails to complete smart update and 3 mongos pods and 2 cfg pods are in non ready state

Steps to Reproduce:

It happens on random occasions, when operator wants to run smart update
Operator Logs:

2023-10-04T15:01:37.167Z	INFO	StatefulSet is not up to date	{"controller": "psmdb-controller", "object": {"name":"psmdb-db","namespace":"percona"}, "namespace": "percona", "name": "psmdb-db", "reconcileID": "73af19f4-7e82-4805-a3c9-9b56a741a0e0", "sts": "psmdb-db-mongos"}
2023-10-04T15:01:42.401Z	INFO	StatefulSet is changed, starting smart update	{"controller": "psmdb-controller", "object": {"name":"psmdb-db","namespace":"percona"}, "namespace": "percona", "name": "psmdb-db", "reconcileID": "b60264f8-fc2c-4ead-af8e-5eb22bc253e3", "name": "psmdb-db-mongos"}
2023-10-04T15:01:42.401Z	INFO	can't start/continue 'SmartUpdate': waiting for all replicas are ready	{"controller": "psmdb-controller", "object": {"name":"psmdb-db","namespace":"percona"}, "namespace": "percona", "name": "psmdb-db", "reconcileID": "b60264f8-fc2c-4ead-af8e-5eb22bc253e3"}
2023-10-04T15:01:42.401Z	INFO	StatefulSet is not up to date	{"controller": "psmdb-controller", "object": {"name":"psmdb-db","namespace":"percona"}, "namespace": "percona", "name": "psmdb-db", "reconcileID": "b60264f8-fc2c-4ead-af8e-5eb22bc253e3", "sts": "psmdb-db-mongos"}
2023-10-04T15:02:24.096Z	INFO	balancer enabled	{"controller": "psmdb-controller", "object": {"name":"psmdb-db","namespace":"percona"}, "namespace": "percona", "name": "psmdb-db", "reconcileID": "c05b58d0-058a-4bf1-9c3a-4ca866f1c154"}
2023-10-04T15:02:24.098Z	INFO	Cluster state changed	{"controller": "psmdb-controller", "object": {"name":"psmdb-db","namespace":"percona"}, "namespace": "percona", "name": "psmdb-db", "reconcileID": "c05b58d0-058a-4bf1-9c3a-4ca866f1c154", "previous": "initializing", "current": "ready"}

Version:

[Insert the version number of the software]

Logs:

2023-10-04T16:50:31+02:00 2023-10-04T14:50:31.704Z INFO doing step down... {"controller": "psmdb-controller", "object": {"name":"psmdb-db","namespace":"percona"}, "namespace": "percona", "name": "psmdb-db", "reconcileID": "b8484a79-d1d9-4ffd-843f-501082f88774", "force": false}

2023-10-04T16:50:56+02:00 2023-10-04T14:50:56.508Z ERROR Reconciler error {"controller": "psmdb-controller", "object": {"name":"psmdb-db","namespace":"percona"}, "namespace": "percona", "name": "psmdb-db", "reconcileID": "b8484a79-d1d9-4ffd-843f-501082f88774", "error": "reconcile StatefulSet for cfg: failed to run smartUpdate: failed to do step down: replSetStepDown: connection pool for psmdb-db-cfg-0.psmdb-db-cfg.percona.svc.cluster.local:27017 was cleared because another operation failed with: (KeyNotFound) No keys found for HMAC that is valid for time: { ts: Timestamp(1696431014, 2) } with id: 7241585499830222852", "errorVerbose": "reconcile StatefulSet for cfg: failed to run smartUpdate: failed to do step down: replSetStepDown: connection pool for psmdb-db-cfg-0.psmdb-db-cfg.percona.svc.cluster.local:27017 was cleared because another operation failed with: (KeyNotFound) No keys found for HMAC that is valid for time: { ts: Timestamp(1696431014, 2) } with id: 7241585499830222852\ngithub.com/percona/percona-server-mongodb-operator/pkg/controller/perconaservermongodb.(*ReconcilePerconaServerMongoDB).Reconcile\n\t/go/src/github.com/percona/percona-server-mongodb-operator/pkg/controller/perconaservermongodb/psmdb_controller.go:412\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.4/pkg/internal/controller/controller.go:122\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.4/pkg/internal/controller/controller.go:323\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.4/pkg/internal/controller/controller.go:274\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.4/pkg/internal/controller/controller.go:235\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1594"}

2023-10-04T16:50:56+02:00 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler2023-10-04T16:50:31+02:00 2023-10-04T14:50:31.704Z INFO doing step down... {"controller": "psmdb-controller", "object": {"name":"psmdb-db","namespace":"percona"}, "namespace": "percona", "name": "psmdb-db", "reconcileID": "b8484a79-d1d9-4ffd-843f-501082f88774", "force": false}

2023-10-04T16:50:56+02:00 2023-10-04T14:50:56.508Z ERROR Reconciler error {"controller": "psmdb-controller", "object": {"name":"psmdb-db","namespace":"percona"}, "namespace": "percona", "name": "psmdb-db", "reconcileID": "b8484a79-d1d9-4ffd-843f-501082f88774", "error": "reconcile StatefulSet for cfg: failed to run smartUpdate: failed to do step down: replSetStepDown: connection pool for psmdb-db-cfg-0.psmdb-db-cfg.percona.svc.cluster.local:27017 was cleared because another operation failed with: (KeyNotFound) No keys found for HMAC that is valid for time: { ts: Timestamp(1696431014, 2) } with id: 7241585499830222852", "errorVerbose": "reconcile StatefulSet for cfg: failed to run smartUpdate: failed to do step down: replSetStepDown: connection pool for psmdb-db-cfg-0.psmdb-db-cfg.percona.svc.cluster.local:27017 was cleared because another operation failed with: (KeyNotFound) No keys found for HMAC that is valid for time: { ts: Timestamp(1696431014, 2) } with id: 7241585499830222852\ngithub.com/percona/percona-server-mongodb-operator/pkg/controller/perconaservermongodb.(*ReconcilePerconaServerMongoDB).Reconcile\n\t/go/src/github.com/percona/percona-server-mongodb-operator/pkg/controller/perconaservermongodb/psmdb_controller.go:412\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.4/pkg/internal/controller/controller.go:122\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.4/pkg/internal/controller/controller.go:323\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.4/pkg/internal/controller/controller.go:274\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.4/pkg/internal/controller/controller.go:235\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1594"}

2023-10-04T16:50:56+02:00 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler

Expected Result:

Psmdb database normal function

Actual Result:

Psmdb cluster and database are in not ready state

Slavisa_Milojkovic · October 4, 2023, 3:20pm

This is the same issue caused by operator smart update failure

Sergey_Pronin · October 17, 2023, 7:28pm

@Slavisa_Milojkovic hey, sorry. Missed this forum thread. Which version of the operator is that?

Smart Update not only manages the upgrades of the software, but it is also a way to safely rotate/restart/etc the Pods (one by one).

Can you pls share your custom resource manifest?

Slavisa_Milojkovic · October 18, 2023, 9:43am

@Sergey_Pronin thanks for your reply. The operator is 1.14.1 and cr 1.14.0

I understand, but why is it restarting pods when nothing was changed (I’m not even sure when it will run) and smartupdate is disabled as you can see in values.

Here’s the values file:

# Default values for psmdb-cluster.
# This is a YAML-formatted file.
# Declare variables to be passed into your templates.

# Platform type: kubernetes, openshift
# platform: kubernetes

# Cluster DNS Suffix
# clusterServiceDNSSuffix: svc.cluster.local
# clusterServiceDNSMode: "Internal"

finalizers:
## Set this if you want that operator deletes the primary pod last
  - delete-psmdb-pods-in-order
## Set this if you want to delete database persistent volumes on cluster deletion
#  - delete-psmdb-pvc

nameOverride: ""
fullnameOverride: ""

crVersion: 1.14.0
pause: false
unmanaged: false
allowUnsafeConfigurations: false
# ignoreAnnotations:
#   - service.beta.kubernetes.io/aws-load-balancer-backend-protocol
# ignoreLabels:
#   - rack
multiCluster:
  enabled: false
  # DNSSuffix: svc.clusterset.local
updateStrategy: SmartUpdate
upgradeOptions:
  versionServiceEndpoint: https://check.percona.com
  apply: disabled
  schedule: "0 2 * * *"
  setFCV: false

image:
  repository: percona/percona-server-mongodb
  tag: 6.0.4-3

imagePullPolicy: Always
# imagePullSecrets: []
# initImage:
#   repository: percona/percona-server-mongodb-operator
#   tag: 1.14.0
# initContainerSecurityContext: {}
# tls:
#   # 90 days in hours
#   certValidityDuration: 2160h
secrets:
  users: psmdb-users
  encryptionKey: my-vault
  # If you set users secret here the operator will use existing one or generate random values
  # If not set the operator generates the default secret with name <cluster_name>-secrets
  # users: my-cluster-name-secrets
  # encryptionKey: my-cluster-name-mongodb-encryption-key

pmm:
  enabled: true
  image:
    repository: percona/pmm-client
    tag: 2.35.0
  serverHost: monitoring-service

replsets:
  - name: rs0
    size: 3
    # externalNodes:
    # - host: 34.124.76.90
    # - host: 34.124.76.91
    #   port: 27017
    #   votes: 0
    #   priority: 0
    # - host: 34.124.76.92
    configuration: |
      security:
        enableEncryption: true
        encryptionCipherMode: "AES256-CBC"
        vault:
          serverName: vault.XXXX.XX
          port: 443
          tokenFile: /etc/mongodb-encryption/token
          secret: XXXX/rs0
          disableTLSForTesting: false
    #   operationProfiling:
    #     mode: slowOp
    #   systemLog:
    #     verbosity: 1
    antiAffinityTopologyKey: "kubernetes.io/hostname"
    # tolerations: []
    priorityClass: "system-node-critical"
    # annotations: {}
    # labels: {}
    # nodeSelector: {}
    # livenessProbe:
    #   failureThreshold: 4
    #   initialDelaySeconds: 60
    #   periodSeconds: 30
    #   timeoutSeconds: 10
    #   startupDelaySeconds: 7200
    # readinessProbe:
    #   failureThreshold: 8
    #   initialDelaySeconds: 10
    #   periodSeconds: 3
    #   successThreshold: 1
    #   timeoutSeconds: 2
    # runtimeClassName: image-rc
    # storage:
    #   engine: wiredTiger
    #   wiredTiger:
    #     engineConfig:
    #       cacheSizeRatio: 0.5
    #       directoryForIndexes: false
    #       journalCompressor: snappy
    #     collectionConfig:
    #       blockCompressor: snappy
    #     indexConfig:
    #       prefixCompression: true
    #   inMemory:
    #     engineConfig:
    #        inMemorySizeRatio: 0.5
    # sidecars:
    # - image: busybox
    #   command: ["/bin/sh"]
    #   args: ["-c", "while true; do echo echo $(date -u) 'test' >> /dev/null; sleep 5;done"]
    #   name: rs-sidecar-1
    #   volumeMounts:
    #     - mountPath: /volume1
    #       name: sidecar-volume-claim
    #     - mountPath: /secret
    #       name: sidecar-secret
    #     - mountPath: /configmap
    #       name: sidecar-config
    # sidecarVolumes:
    # - name: sidecar-secret
    #   secret:
    #     secretName: mysecret
    # - name: sidecar-config
    #   configMap:
    #     name: myconfigmap
    # sidecarPVCs:
    # - apiVersion: v1
    #   kind: PersistentVolumeClaim
    #   metadata:
    #     name: sidecar-volume-claim
    #   spec:
    #     resources:
    #       requests:
    #         storage: 1Gi
    #     volumeMode: Filesystem
    #     accessModes:
    #       - ReadWriteOnce
    podDisruptionBudget:
      maxUnavailable: 1
    expose:
      enabled: false
      exposeType: ClusterIP
      # loadBalancerSourceRanges:
      #   - 10.0.0.0/8
      # serviceAnnotations:
      #   service.beta.kubernetes.io/aws-load-balancer-backend-protocol: http
      # serviceLabels: 
      #   some-label: some-key
    nonvoting:
      enabled: false
      # podSecurityContext: {}
      # containerSecurityContext: {}
      size: 3
      configuration: |
        security:
          enableEncryption: true
          encryptionCipherMode: "AES256-CBC"
          vault:
            serverName: vault.XXXXX.XX
            port: 443
            tokenFile: /etc/mongodb-encryption/token
            secret: XXXXX/rs0
            disableTLSForTesting: false
      #   operationProfiling:
      #     mode: slowOp
      #   systemLog:
      #     verbosity: 1
      antiAffinityTopologyKey: "kubernetes.io/hostname"
      # tolerations: []
      # priorityClass: ""
      # annotations: {}
      # labels: {}
      # nodeSelector: {}
      podDisruptionBudget:
        maxUnavailable: 1
      resources:
        limits:
          cpu: "300m"
          memory: "0.5G"
        requests:
          cpu: "300m"
          memory: "0.5G"
      volumeSpec:
        # emptyDir: {}
        # hostPath:
        #   path: /data
        pvc:
          # annotations:
          #   volume.beta.kubernetes.io/storage-class: example-hostpath
          # labels:
          #   rack: rack-22
          # storageClassName: standard
          # accessModes: [ "ReadWriteOnce" ]
          resources:
            requests:
              storage: 3Gi
    arbiter:
      enabled: false
      size: 1
      antiAffinityTopologyKey: "kubernetes.io/hostname"
      # tolerations: []
      # priorityClass: ""
      # annotations: {}
      # labels: {}
      # nodeSelector: {}
    # schedulerName: ""
    resources:
      limits:
        cpu: "1000m"
        memory: "1G"
      requests:
        cpu: "300m"
        memory: "0.5G"
    volumeSpec:
      # emptyDir: {}
      # hostPath:
      #   path: /data
      pvc:
        # annotations:
        #   volume.beta.kubernetes.io/storage-class: example-hostpath
        # labels:
        #   rack: rack-22
        # storageClassName: standard
        # accessModes: [ "ReadWriteOnce" ]
        resources:
          requests:
            storage: 3Gi

sharding:
  enabled: true

  configrs:
    size: 3
    # externalNodes:
    # - host: 34.124.76.90
    # - host: 34.124.76.91
    #   port: 27017
    #   votes: 0
    #   priority: 0
    # - host: 34.124.76.92
    configuration: |
      security:
        enableEncryption: true
        encryptionCipherMode: "AES256-CBC"
        vault:
          serverName: vault.XXXX.XX
          port: 443
          tokenFile: /etc/mongodb-encryption/token
          secret: XXXX/cfg
          disableTLSForTesting: false
    #   operationProfiling:
    #     mode: slowOp
    #   systemLog:
    #     verbosity: 1
    antiAffinityTopologyKey: "kubernetes.io/hostname"
    # tolerations: []
    priorityClass: "system-node-critical"
    # annotations: {}
    # labels: {}
    # nodeSelector: {}
    # livenessProbe: {}
    # readinessProbe: {}
    # runtimeClassName: image-rc
    # sidecars:
    # - image: busybox
    #   command: ["/bin/sh"]
    #   args: ["-c", "while true; do echo echo $(date -u) 'test' >> /dev/null; sleep 5;done"]
    #   name: rs-sidecar-1
    #   volumeMounts:
    #     - mountPath: /volume1
    #       name: sidecar-volume-claim
    # sidecarPVCs: []
    # sidecarVolumes: []
    podDisruptionBudget:
      maxUnavailable: 1
    expose:
      enabled: false
      exposeType: ClusterIP
      # loadBalancerSourceRanges:
      #   - 10.0.0.0/8
      # serviceAnnotations:
      #   service.beta.kubernetes.io/aws-load-balancer-backend-protocol: http
      # serviceLabels: 
      #   some-label: some-key
    resources:
      limits:
        cpu: "300m"
        memory: "0.5G"
      requests:
        cpu: "300m"
        memory: "0.5G"
    volumeSpec:
      # emptyDir: {}
      # hostPath:
      #   path: /data
      #   type: Directory
      pvc:
        # annotations:
        #   volume.beta.kubernetes.io/storage-class: example-hostpath
        # labels:
        #   rack: rack-22
        storageClassName: "nfs-client"
        accessModes: [ "ReadWriteMany" ]
        resources:
          requests:
            storage: 3Gi

  mongos:
    size: 3
    # configuration: |
    #   systemLog:
    #     verbosity: 1
    antiAffinityTopologyKey: "kubernetes.io/hostname"
    # tolerations: []
    priorityClass: "system-node-critical"
    # annotations: {}
    # labels: {}
    # nodeSelector: {}
    # livenessProbe: {}
    # readinessProbe: {}
    # runtimeClassName: image-rc
    # sidecars:
    # - image: busybox
    #   command: ["/bin/sh"]
    #   args: ["-c", "while true; do echo echo $(date -u) 'test' >> /dev/null; sleep 5;done"]
    #   name: rs-sidecar-1
    #   volumeMounts:
    #     - mountPath: /volume1
    #       name: sidecar-volume-claim
    # sidecarPVCs: []
    # sidecarVolumes: []
    podDisruptionBudget:
      maxUnavailable: 1
    resources:
      limits:
        cpu: "300m"
        memory: "0.5G"
      requests:
        cpu: "300m"
        memory: "0.5G"
    expose:
      exposeType: ClusterIP
      # servicePerPod: true
      # loadBalancerSourceRanges:
      #   - 10.0.0.0/8
      # serviceAnnotations:
      #   service.beta.kubernetes.io/aws-load-balancer-backend-protocol: http
      # serviceLabels: 
      #   some-label: some-key
    # auditLog:
    #   destination: file
    #   format: BSON
    #   filter: '{}'

backup:
  enabled: true
  image:
    repository: percona/percona-backup-mongodb
    tag: 2.0.4
  serviceAccountName: percona-server-mongodb-operator
  # annotations:
  #   iam.amazonaws.com/role: role-arn
  # resources:
  #   limits:
  #     cpu: "300m"
  #     memory: "0.5G"
  #   requests:
  #     cpu: "300m"
  #     memory: "0.5G"
  storages:
    # s3-us-west:
    #   type: s3
    #   s3:
    #     bucket: S3-BACKUP-BUCKET-NAME-HERE
    #     credentialsSecret: my-cluster-name-backup-s3
    #     region: us-west-2
    #     prefix: ""
    #     uploadPartSize: 10485760
    #     maxUploadParts: 10000
    #     storageClass: STANDARD
    #     insecureSkipTLSVerify: false
    minio:
      type: s3
      s3:
        bucket: xzxxxx-percona
        region: us-west-1
        credentialsSecret: xxxxxx-minio
        endpointUrl: https://xxx.xxx.xxx:9000/
        prefix: "prod_"
    #   azure-blob:
    #     type: azure
    #     azure:
    #       container: CONTAINER-NAME
    #       prefix: PREFIX-NAME
    #       credentialsSecret: SECRET-NAME
  pitr:
    enabled: true
    oplogSpanMin: 10
    compressionType: gzip
    compressionLevel: 6
  tasks:
  - name: daily-minio
    enabled: true
    schedule: "0 0 * * *"
    keep: 10
    storageName: minio
    compressionType: gzip
  # - name: weekly-s3-us-west
  #   enabled: false
  #   schedule: "0 0 * * 0"
  #   keep: 5
  #   storageName: s3-us-west
  #   compressionType: gzip
  # - name: weekly-s3-us-west-physical
  #   enabled: false
  #   schedule: "0 5 * * 0"
  #   keep: 5
  #   type: physical
  #   storageName: s3-us-west
  #   compressionType: gzip
  #   compressionLevel: 6

Tomislav_Plavcic · October 19, 2023, 6:14pm

Hi @Slavisa_Milojkovic !

Smart update is not disabled in your custom resource because you have:

updateStrategy: SmartUpdate

if you have problems with SmartUpdate you can set this to:

updateStrategy: RollingUpdate

and see how it goes.

But in general you should monitor your CR and stateful sets and see if something is changing and what is triggering the update (you can check the number in generation field in those objects, it should increase on change).
I don’t see something specific in your CR except for usage of hashicorp vault for encryption keys.
The problem is that your issue is happening randomly so even if I try to reproduce it with your CR I cannot be certain of anything.
So please try to monitor for changes, there might be a bug in SmartUpdate logic.
Thanks!

Slavisa_Milojkovic · October 20, 2023, 10:35am

Hello @Tomislav_Plavcic,
sorry I confused update strategy with upgrade options…

I’ll try your suggestion and try to figure out what’s causing this behavior. This is also happening with pxc operator on our clusters, but it restarts pods successfully and there are no issues.
Vault secrets and encryption at rest have worked fine until now, so I don’t think it has anything to do with smartupdate.
The only other difference from the default values is that I increased mongos replica from 2 to 3.

So if I change update strategy, operator will also recover the cluster if something happens I suppose?

Tomislav_Plavcic · October 23, 2023, 8:34am

Operator should work with RollingUpdate strategy normally, but in this case it will not restart pods in optimal way (primary last) and on database upgrade it will not optimally upgrade pods cfg, mongod then mongos (if I’m not mistaking), but it should work and I don’t suggest to try it in production environment (first try in test environment and if it solves your issues then apply to production).
SmartUpdate strategy is preferred and we test mostly with it enabled.

Do you maybe have some passwords/secrets rotation going on (in general, but for the vault also)?
Also are your ssl secrets maybe rotated and that can cause operator to restart pods.

About your error message which is seen in the logs there is some thread here: Cache Reader No keys found for HMAC that is valid for time - #7 by joaorpsoares
So in general I think operator is having trouble connecting/issuing commands to your cluster because of this error so it needs to be investigated more why this error happens.

Slavisa_Milojkovic · October 23, 2023, 1:02pm

Hello @Tomislav_Plavcic , yes I do rotate vault tokens weekly (the k8s secret is deleted and recreated using pipeline), but usually smartupdate is not triggered, and there are no issues. SSL certs are managed by cert-manager I think, I haven’t configured anything outside helm charts.
I’ll try to figure out more when it happens again and will report back. If the issue becomes more regular, I’ll try RollingUpdate instead. Thank you

Topic		Replies	Views
Controller.psmdb-controller Reconciler error Percona Operator for MongoDB	2	1229	April 28, 2023
Upgrade from 1.6.0 to 1.7.0 fails Percona Operator for MongoDB mongodb	5	659	January 27, 2022
Mongo replset fails to restart if backup is switch on/off Percona Operator for MongoDB percona , mongodb	7	1877	December 9, 2022
Percona-mongo-psmdb-d is stuck in intializing status Percona Distribution for MongoDB percona , closed-no-reply	3	1223	July 16, 2024
Cluster in initializing state Percona Operator for MongoDB	2	1543	July 26, 2023