ReplicaSet host unreachable

Slavisa_Milojkovic · April 10, 2023, 8:51am

Hello, we are having some issues with psmdb replica set. Since we updated to 1.14 operator and psmdb this happened couple of times on 2 different Kubernetes clusters.
At random, replica set node is reporting a connection error, then we see “msg”:“SSL peer certificate validation failed”,“attr”:{“reason”:“self signed certificate”} error, and all 3 nodes are down, pods are restarting all the time until we kill all 3 pods and rs is restarted node by node.

These are the first error when it happened by ReplicaSetMonitor-TaskExecutor

Operator is stating the cluster was changed and throws this error:

Thank you for your help

Ege_Gunes · April 10, 2023, 9:03am

Hi @Slavisa_Milojkovic,

Does the issue happens on its own or after a change to custom resource? And could you please share full operator logs so we can investigate?

Slavisa_Milojkovic · April 10, 2023, 9:17am

It happened 2 times on its own, no changes were made to psmdb. After the helm update, it worked fine for some days and then this happened out of the blue.

These are operator logs from when I restarted the replicaset rs0 node pods.

i attached the latest operator log
psmdb-operator-b85979d4f-4tt4j.log (37.4 KB)

Ege_Gunes · April 10, 2023, 9:20am

If you can share the operator logs when the problem happened, it’d be much more helpful.

Also, could you please share your cr.yaml (kubectl get psmdb psmdb-db -o yaml) and helm values.yaml?

Slavisa_Milojkovic · April 10, 2023, 9:26am

apiVersion: psmdb.percona.com/v1
kind: PerconaServerMongoDB
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"psmdb.percona.com/v1","kind":"PerconaServerMongoDB"}
    meta.helm.sh/release-name: psmdb-db
    meta.helm.sh/release-namespace: percona
  creationTimestamp: "2023-02-03T11:06:26Z"
  finalizers:
  - delete-psmdb-pods-in-order
  generation: 4
  labels:
    app.kubernetes.io/instance: psmdb-db
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: psmdb-db
    app.kubernetes.io/version: 1.14.0
    helm.sh/chart: psmdb-db-1.14.0
  name: psmdb-db
  namespace: percona
  resourceVersion: "45983449"
  uid: 5114271a-317c-4bc9-8abb-ca08ba45750f
spec:
  backup:
    enabled: true
    image: percona/percona-backup-mongodb:2.0.4
    pitr:
      compressionLevel: 6
      compressionType: gzip
      enabled: true
      oplogSpanMin: 10
    serviceAccountName: percona-server-mongodb-operator
    storages:
      minio:
        s3:
          bucket: k8s-pd-prod-percona
          credentialsSecret: psmdb-backup-minio
          endpointUrl: https://os-us-west-1.webprovise.io:9000/
          prefix: prod_
          region: us-west-1
        type: s3
    tasks:
    - compressionType: gzip
      enabled: true
      keep: 10
      name: daily-minio
      schedule: 0 0 * * *
      storageName: minio
  crVersion: 1.14.0
  image: percona/percona-server-mongodb:6.0.4-3
  imagePullPolicy: Always
  multiCluster:
    enabled: false
  pause: false
  pmm:
    enabled: false
    image: percona/pmm-client:2.35.0
    serverHost: monitoring-service
  replsets:
  - affinity:
      antiAffinityTopologyKey: kubernetes.io/hostname
    arbiter:
      affinity:
        antiAffinityTopologyKey: kubernetes.io/hostname
      enabled: false
      size: 1
    expose:
      enabled: false
      exposeType: ClusterIP
    name: rs0
    nonvoting:
      affinity:
        antiAffinityTopologyKey: kubernetes.io/hostname
      enabled: false
      podDisruptionBudget:
        maxUnavailable: 1
      resources:
        limits:
          cpu: 300m
          memory: 0.5G
        requests:
          cpu: 300m
          memory: 0.5G
      size: 3
      volumeSpec:
        persistentVolumeClaim:
          resources:
            requests:
              storage: 3Gi
    podDisruptionBudget:
      maxUnavailable: 1
    priorityClassName: system-node-critical
    resources:
      limits:
        cpu: 1000m
        memory: 1G
      requests:
        cpu: 300m
        memory: 0.5G
    size: 3
    volumeSpec:
      persistentVolumeClaim:
        resources:
          requests:
            storage: 3Gi
  secrets:
    encryptionKey: psmdb-mongodb-encryption-key
    users: psmdb-users
  sharding:
    configsvrReplSet:
      affinity:
        antiAffinityTopologyKey: kubernetes.io/hostname
      expose:
        enabled: false
        exposeType: ClusterIP
      podDisruptionBudget:
        maxUnavailable: 1
      priorityClassName: system-node-critical
      resources:
        limits:
          cpu: 300m
          memory: 0.5G
        requests:
          cpu: 300m
          memory: 0.5G
      size: 3
      volumeSpec:
        persistentVolumeClaim:
          accessModes:
          - ReadWriteMany
          resources:
            requests:
              storage: 3Gi
          storageClassName: nfs-client
    enabled: true
    mongos:
      affinity:
        antiAffinityTopologyKey: kubernetes.io/hostname
      expose:
        exposeType: ClusterIP
      podDisruptionBudget:
        maxUnavailable: 1
      priorityClassName: system-node-critical
      resources:
        limits:
          cpu: 300m
          memory: 0.5G
        requests:
          cpu: 300m
          memory: 0.5G
      size: 3
  unmanaged: false
  updateStrategy: SmartUpdate
  upgradeOptions:
    apply: disabled
    schedule: 0 2 * * *
    setFCV: false
    versionServiceEndpoint: https://check.percona.com
status:
  conditions:
  - lastTransitionTime: "2023-04-10T01:58:30Z"
    status: "True"
    type: initializing
  - lastTransitionTime: "2023-04-10T02:04:00Z"
    reason: MongosReady
    status: "True"
    type: ready
  - lastTransitionTime: "2023-04-10T02:04:00Z"
    status: "True"
    type: initializing
  - lastTransitionTime: "2023-04-10T02:09:59Z"
    reason: MongosReady
    status: "True"
    type: ready
  - lastTransitionTime: "2023-04-10T02:09:59Z"
    status: "True"
    type: initializing
  - lastTransitionTime: "2023-04-10T02:15:31Z"
    reason: MongosReady
    status: "True"
    type: ready
  - lastTransitionTime: "2023-04-10T02:15:31Z"
    status: "True"
    type: initializing
  - lastTransitionTime: "2023-04-10T02:38:31Z"
    reason: MongosReady
    status: "True"
    type: ready
  - lastTransitionTime: "2023-04-10T02:38:31Z"
    status: "True"
    type: initializing
  - lastTransitionTime: "2023-04-10T06:26:53Z"
    reason: MongosReady
    status: "True"
    type: ready
  - lastTransitionTime: "2023-04-10T06:26:53Z"
    status: "True"
    type: initializing
  - lastTransitionTime: "2023-04-10T06:43:35Z"
    reason: MongosReady
    status: "True"
    type: ready
  - lastTransitionTime: "2023-04-10T06:43:35Z"
    status: "True"
    type: initializing
  - lastTransitionTime: "2023-04-10T06:49:17Z"
    reason: MongosReady
    status: "True"
    type: ready
  - lastTransitionTime: "2023-04-10T06:49:17Z"
    status: "True"
    type: initializing
  - lastTransitionTime: "2023-04-10T07:13:16Z"
    reason: MongosReady
    status: "True"
    type: ready
  - lastTransitionTime: "2023-04-10T07:13:16Z"
    status: "True"
    type: initializing
  - lastTransitionTime: "2023-04-10T07:13:26Z"
    message: 'rs0: ready'
    reason: RSReady
    status: "True"
    type: ready
  - lastTransitionTime: "2023-04-10T07:13:26Z"
    status: "True"
    type: initializing
  - lastTransitionTime: "2023-04-10T07:13:32Z"
    status: "True"
    type: ready
  host: psmdb-db-mongos.percona.svc.cluster.local
  mongoImage: percona/percona-server-mongodb:6.0.4-3
  mongoVersion: 6.0.4-3
  mongos:
    ready: 3
    size: 3
    status: ready
  observedGeneration: 4
  ready: 9
  replsets:
    cfg:
      initialized: true
      ready: 3
      size: 3
      status: ready
    rs0:
      added_as_shard: true
      initialized: true
      ready: 3
      size: 3
      status: ready
  size: 9
  state: ready

Slavisa_Milojkovic · April 10, 2023, 9:30am

I’ve updated the first post with operator logs from when the issue started.

Ege_Gunes · April 10, 2023, 9:50am

I only see symptoms (rs pods are not ready) in the logs but nothing that can be the cause. Could you please provide:

Your Kubernetes platform and version
Full logs from each replicaset pod from the time the problem happened
Full logs from operator pod from the time the problem happened

Also, do you have something else running on the Kubernetes cluster? Did you see any other problems? Is there any chance that Kubernetes cluster had any problems (e.g. network issues)?

Slavisa_Milojkovic · April 10, 2023, 9:51am

Here are the values file. We are using sealed secrets for creds and encryption key, could that be the problem?

psmdb-values.txt (10.3 KB)

Ege_Gunes · April 10, 2023, 10:03am

We are using sealed secrets for creds and encryption key, could that be the problem?

I doubt it. At least it doesn’t explain pods becoming unready randomly.

I see you’re using NFS for persistent volumes. Is there any chance you lose connectivity to your NFS server, even for a short period?

Slavisa_Milojkovic · April 10, 2023, 10:16am

Could be possible. Most errors are connection issues. I’ll monitor it further and get back to you

Slavisa_Milojkovic · September 26, 2023, 2:41pm

Hello, I still have this issue. I narrowed it down and it seems that the operator can’t recover the cluster once it starts the smart update. Can this update be disabled and how it detects that statefulset is not up to date?
So when this happens, 2 psmdb-cfg pods are restarted and the third one is not (age is different), also all of the 3 mongos pods are displaying readiness error

When I delete the third cfg pod, operator continues to restart rs0 pods one by one and mongos one by one. It seems to me it start restarting cfg pods and it fails on the third one, once I delete that one and is recreated, operator continues restarting the rest of the psmdb pods. Please check the attached log and help me fix this issue if possible. Thank you

psmdb-operator-6f9bf9585c-4xnpc.log (54.3 KB)

Topic		Replies	Views
Kubernetes PSMDB shutdown signal 15 Percona Operator for MongoDB percona , mongodb , kubernetes	11	2384	September 7, 2021
Basic cluster with TLS not working, operator k8s/helm setup Percona Server for MongoDB mongodb , psmdb-operator	18	3200	December 18, 2022
Psmdb operator fails to complete smart update Percona Operator for MongoDB	7	1403	October 23, 2023
Mongo replset fails to restart if backup is switch on/off Percona Operator for MongoDB percona , mongodb	7	1879	December 9, 2022
Issue with upgrade Percona Operator for MongoDB	10	738	June 9, 2024

ReplicaSet host unreachable

Related topics