ReplicaSet host unreachable

Hello, we are having some issues with psmdb replica set. Since we updated to 1.14 operator and psmdb this happened couple of times on 2 different Kubernetes clusters.
At random, replica set node is reporting a connection error, then we see “msg”:“SSL peer certificate validation failed”,“attr”:{“reason”:“self signed certificate”} error, and all 3 nodes are down, pods are restarting all the time until we kill all 3 pods and rs is restarted node by node.

These are the first error when it happened by ReplicaSetMonitor-TaskExecutor

Operator is stating the cluster was changed and throws this error:


Thank you for your help

Hi @Slavisa_Milojkovic,

Does the issue happens on its own or after a change to custom resource? And could you please share full operator logs so we can investigate?

It happened 2 times on its own, no changes were made to psmdb. After the helm update, it worked fine for some days and then this happened out of the blue.

These are operator logs from when I restarted the replicaset rs0 node pods.

i attached the latest operator log
psmdb-operator-b85979d4f-4tt4j.log (37.4 KB)

If you can share the operator logs when the problem happened, it’d be much more helpful.

Also, could you please share your cr.yaml (kubectl get psmdb psmdb-db -o yaml) and helm values.yaml?

apiVersion: psmdb.percona.com/v1
kind: PerconaServerMongoDB
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"psmdb.percona.com/v1","kind":"PerconaServerMongoDB"}
    meta.helm.sh/release-name: psmdb-db
    meta.helm.sh/release-namespace: percona
  creationTimestamp: "2023-02-03T11:06:26Z"
  finalizers:
  - delete-psmdb-pods-in-order
  generation: 4
  labels:
    app.kubernetes.io/instance: psmdb-db
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: psmdb-db
    app.kubernetes.io/version: 1.14.0
    helm.sh/chart: psmdb-db-1.14.0
  name: psmdb-db
  namespace: percona
  resourceVersion: "45983449"
  uid: 5114271a-317c-4bc9-8abb-ca08ba45750f
spec:
  backup:
    enabled: true
    image: percona/percona-backup-mongodb:2.0.4
    pitr:
      compressionLevel: 6
      compressionType: gzip
      enabled: true
      oplogSpanMin: 10
    serviceAccountName: percona-server-mongodb-operator
    storages:
      minio:
        s3:
          bucket: k8s-pd-prod-percona
          credentialsSecret: psmdb-backup-minio
          endpointUrl: https://os-us-west-1.webprovise.io:9000/
          prefix: prod_
          region: us-west-1
        type: s3
    tasks:
    - compressionType: gzip
      enabled: true
      keep: 10
      name: daily-minio
      schedule: 0 0 * * *
      storageName: minio
  crVersion: 1.14.0
  image: percona/percona-server-mongodb:6.0.4-3
  imagePullPolicy: Always
  multiCluster:
    enabled: false
  pause: false
  pmm:
    enabled: false
    image: percona/pmm-client:2.35.0
    serverHost: monitoring-service
  replsets:
  - affinity:
      antiAffinityTopologyKey: kubernetes.io/hostname
    arbiter:
      affinity:
        antiAffinityTopologyKey: kubernetes.io/hostname
      enabled: false
      size: 1
    expose:
      enabled: false
      exposeType: ClusterIP
    name: rs0
    nonvoting:
      affinity:
        antiAffinityTopologyKey: kubernetes.io/hostname
      enabled: false
      podDisruptionBudget:
        maxUnavailable: 1
      resources:
        limits:
          cpu: 300m
          memory: 0.5G
        requests:
          cpu: 300m
          memory: 0.5G
      size: 3
      volumeSpec:
        persistentVolumeClaim:
          resources:
            requests:
              storage: 3Gi
    podDisruptionBudget:
      maxUnavailable: 1
    priorityClassName: system-node-critical
    resources:
      limits:
        cpu: 1000m
        memory: 1G
      requests:
        cpu: 300m
        memory: 0.5G
    size: 3
    volumeSpec:
      persistentVolumeClaim:
        resources:
          requests:
            storage: 3Gi
  secrets:
    encryptionKey: psmdb-mongodb-encryption-key
    users: psmdb-users
  sharding:
    configsvrReplSet:
      affinity:
        antiAffinityTopologyKey: kubernetes.io/hostname
      expose:
        enabled: false
        exposeType: ClusterIP
      podDisruptionBudget:
        maxUnavailable: 1
      priorityClassName: system-node-critical
      resources:
        limits:
          cpu: 300m
          memory: 0.5G
        requests:
          cpu: 300m
          memory: 0.5G
      size: 3
      volumeSpec:
        persistentVolumeClaim:
          accessModes:
          - ReadWriteMany
          resources:
            requests:
              storage: 3Gi
          storageClassName: nfs-client
    enabled: true
    mongos:
      affinity:
        antiAffinityTopologyKey: kubernetes.io/hostname
      expose:
        exposeType: ClusterIP
      podDisruptionBudget:
        maxUnavailable: 1
      priorityClassName: system-node-critical
      resources:
        limits:
          cpu: 300m
          memory: 0.5G
        requests:
          cpu: 300m
          memory: 0.5G
      size: 3
  unmanaged: false
  updateStrategy: SmartUpdate
  upgradeOptions:
    apply: disabled
    schedule: 0 2 * * *
    setFCV: false
    versionServiceEndpoint: https://check.percona.com
status:
  conditions:
  - lastTransitionTime: "2023-04-10T01:58:30Z"
    status: "True"
    type: initializing
  - lastTransitionTime: "2023-04-10T02:04:00Z"
    reason: MongosReady
    status: "True"
    type: ready
  - lastTransitionTime: "2023-04-10T02:04:00Z"
    status: "True"
    type: initializing
  - lastTransitionTime: "2023-04-10T02:09:59Z"
    reason: MongosReady
    status: "True"
    type: ready
  - lastTransitionTime: "2023-04-10T02:09:59Z"
    status: "True"
    type: initializing
  - lastTransitionTime: "2023-04-10T02:15:31Z"
    reason: MongosReady
    status: "True"
    type: ready
  - lastTransitionTime: "2023-04-10T02:15:31Z"
    status: "True"
    type: initializing
  - lastTransitionTime: "2023-04-10T02:38:31Z"
    reason: MongosReady
    status: "True"
    type: ready
  - lastTransitionTime: "2023-04-10T02:38:31Z"
    status: "True"
    type: initializing
  - lastTransitionTime: "2023-04-10T06:26:53Z"
    reason: MongosReady
    status: "True"
    type: ready
  - lastTransitionTime: "2023-04-10T06:26:53Z"
    status: "True"
    type: initializing
  - lastTransitionTime: "2023-04-10T06:43:35Z"
    reason: MongosReady
    status: "True"
    type: ready
  - lastTransitionTime: "2023-04-10T06:43:35Z"
    status: "True"
    type: initializing
  - lastTransitionTime: "2023-04-10T06:49:17Z"
    reason: MongosReady
    status: "True"
    type: ready
  - lastTransitionTime: "2023-04-10T06:49:17Z"
    status: "True"
    type: initializing
  - lastTransitionTime: "2023-04-10T07:13:16Z"
    reason: MongosReady
    status: "True"
    type: ready
  - lastTransitionTime: "2023-04-10T07:13:16Z"
    status: "True"
    type: initializing
  - lastTransitionTime: "2023-04-10T07:13:26Z"
    message: 'rs0: ready'
    reason: RSReady
    status: "True"
    type: ready
  - lastTransitionTime: "2023-04-10T07:13:26Z"
    status: "True"
    type: initializing
  - lastTransitionTime: "2023-04-10T07:13:32Z"
    status: "True"
    type: ready
  host: psmdb-db-mongos.percona.svc.cluster.local
  mongoImage: percona/percona-server-mongodb:6.0.4-3
  mongoVersion: 6.0.4-3
  mongos:
    ready: 3
    size: 3
    status: ready
  observedGeneration: 4
  ready: 9
  replsets:
    cfg:
      initialized: true
      ready: 3
      size: 3
      status: ready
    rs0:
      added_as_shard: true
      initialized: true
      ready: 3
      size: 3
      status: ready
  size: 9
  state: ready

I’ve updated the first post with operator logs from when the issue started.

I only see symptoms (rs pods are not ready) in the logs but nothing that can be the cause. Could you please provide:

  1. Your Kubernetes platform and version
  2. Full logs from each replicaset pod from the time the problem happened
  3. Full logs from operator pod from the time the problem happened

Also, do you have something else running on the Kubernetes cluster? Did you see any other problems? Is there any chance that Kubernetes cluster had any problems (e.g. network issues)?

Here are the values file. We are using sealed secrets for creds and encryption key, could that be the problem?

psmdb-values.txt (10.3 KB)

We are using sealed secrets for creds and encryption key, could that be the problem?

I doubt it. At least it doesn’t explain pods becoming unready randomly.

I see you’re using NFS for persistent volumes. Is there any chance you lose connectivity to your NFS server, even for a short period?

Could be possible. Most errors are connection issues. I’ll monitor it further and get back to you

1 Like

Hello, I still have this issue. I narrowed it down and it seems that the operator can’t recover the cluster once it starts the smart update. Can this update be disabled and how it detects that statefulset is not up to date?
So when this happens, 2 psmdb-cfg pods are restarted and the third one is not (age is different), also all of the 3 mongos pods are displaying readiness error

When I delete the third cfg pod, operator continues to restart rs0 pods one by one and mongos one by one. It seems to me it start restarting cfg pods and it fails on the third one, once I delete that one and is recreated, operator continues restarting the rest of the psmdb pods. Please check the attached log and help me fix this issue if possible. Thank you

psmdb-operator-6f9bf9585c-4xnpc.log (54.3 KB)