Pods occasionally fail readiness check, can't find out why, but cluster otherwise works?

I have a 3 machine cluster, 8c16t 64GBram 2x250GB SSDs in backup raid. Not much on it, rarely exceeds 10% resource usage.

The problem I’m having is some pods very often fail the readiness check and I can’t connnect from the outside world to it. But internally some other pods from my HTTP services write and read from the cluster without issues.

when I run describe on the pod I see:

  Warning                 Unhealthy  26m (x14404 over 14d)                             kubelet  (combined from similar events): Readiness probe failed: 2025-01-07T13:44:43.445Z  INFO  Running mongodb-healthcheck  {"commit": "badcbc6fc9c8c590e73f98ab757c9ec7cf2b7935", "branch": "release-1-18-0"}
2025-01-07T13:44:43.445Z  INFO       Running Kubernetes readiness check for component  {"component": "mongod"}
2025-01-07T13:44:43.445Z  DEBUG      MongodReadinessCheck                              Connecting to localhost:27017
2025-01-07T13:44:43.446Z  ERROR      Failed to perform check                           {"error": "member failed Kubernetes readiness check: dial: dial tcp [::1]:27017: connect: connection refused", "errorVerbose": "dial tcp [::1]:27017: connect: connection refused\ndial\ngithub.com/percona/percona-server-mongodb-operator/cmd/mongodb-healthcheck/healthcheck.MongodReadinessCheck\n\t/go/src/github.com/percona/percona-server-mongodb-operator/cmd/mongodb-healthcheck/healthcheck/readiness.go:38\ngithub.com/percona/percona-server-mongodb-operator/cmd/mongodb-healthcheck/tool.(*App).Run\n\t/go/src/github.com/percona/percona-server-mongodb-operator/cmd/mongodb-healthcheck/tool/tool.go:114\nmain.main\n\t/go/src/github.com/percona/percona-server-mongodb-operator/cmd/mongodb-healthcheck/main.go:67\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:271\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1695\nmember failed Kubernetes readiness check"}
main.main
  /go/src/github.com/percona/percona-server-mongodb-operator/cmd/mongodb-healthcheck/main.go:68
runtime.main
           /usr/local/go/src/runtime/proc.go:271
  Warning  BackOff  115s (x88085 over 14d)  kubelet  Back-off restarting failed container mongod in pod mongo-cluster-rs0-0_mongo(16427173-16c0-450a-bb99-f2014a46cc4f)

Any ideas on why this might be happening and why its not recovering from it?

It seems that some other pods also suffer from restarts but are eventually recovering fine:

NAME                                               READY   STATUS    RESTARTS           AGE
percona-server-mongodb-operator-7f7764cd57-xldlm   1/1     Running   2 (22d ago)        32d
mongo-cluster-rs0-0                            1/2     Running   3542 (5m16s ago)   24d
mongo-cluster-rs0-1                            2/2     Running   4 (22d ago)        24d
mongo-cluster-rs0-2                            2/2     Running   0                  24d

I have not been able to find any clues in the logss so any suggestions are appreciated.

Versions:

apiVersion: psmdb.percona.com/v1
kind: PerconaServerMongoDB
metadata:
  name: mongo-cluster
  finalizers:
spec:
  clusterServiceDNSMode: "External"
  crVersion: 1.18.0
  image: percona/percona-server-mongodb:7.0.14
  imagePullPolicy: Always
  allowUnsafeConfigurations: false
  updateStrategy: SmartUpdate
  upgradeOptions:
    versionServiceEndpoint: https://check.percona.com
    apply: disabled
    schedule: "0 2 * * *"
    setFCV: false
  secrets:
    users: mongo-cluster-secrets
    encryptionKey: mongo-cluster-mongodb-encryption-key
  # tls:
  #   mode: preferTLS
  pmm:
    enabled: false
    image: percona/pmm-client:2.43.2
    serverHost: monitoring-service

@owlee

Yes, its failing the readiness check. Did you verify the connectivity by directly connecting to the pod (cluster-rs0-0 ) and checking the error logs/configurations etc ? Did you get anything in kubectl logs mongo-cluster-rs0-0 ?

2025-01-07T13:44:43.446Z  ERROR      Failed to perform check                           {"error": "member failed Kubernetes readiness check: dial: dial tcp [::1]:27017: connect: connection refused", "errorVerbose": "dial tcp [::1]:27017: connect: connection refused\ndial\ngithub.com/percona/percona-server-mongodb-operator/cmd/mongodb-healthcheck/healthcheck.MongodReadinessCheck\n\t/go/src/github.com/percona/percona-server-mongodb-operator/cmd/mongodb-healthcheck/healthcheck/readiness.go:38\ngithub.com/percona/percona-server-mongodb-operator/cmd/mongodb-healthcheck/tool.(*App).Run\n\t/go/src/github.com/percona/percona-server-mongodb-operator/cmd/mongodb-healthcheck/tool/tool.go:114\nmain.main\n\t/go/src/github.com/percona/percona-server-mongodb-operator/cmd/mongodb-healthcheck/main.go:67\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:271\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1695\nmember failed Kubernetes readiness check"}

...

  Warning  BackOff  115s (x88085 over 14d)  kubelet  Back-off restarting failed container mongod in pod mongo-cluster-rs0-0_mongo(16427173-16c0-450a-bb99-f2014a46cc4f)

Inside the pod you can verify the service and other information as below.

kubectl exec -it mongo-cluster-rs0-0 -- bash
bash> ps aux | grep mongod

This helps in getting the resource related information,
kubectl top pod mongo-cluster-rs0-0

Have you tried deleting the pod/pvc also ? This will re-initialize the pod again.

I see you are using v1.18.0 of Percona mongo operator. Is that some customized image or the original one ?

Can you please share the full output of kubectl describe pod mongo-cluster-rs0-0 and the deployment file if possible ?

@anil.joshi sorry for the late reply, I was waiting for an error to happen again

After my post I had the idea to downgrade to an older mongo version from 8.x/7.x to 6.x. I don’t know the exact version but if it helps I can look them up.

Things were good since then for almost a month and a half, then, 6 days the operator did something routine, not sure if it updated anything as all pods age show up as 6 days of age. But my 2nd replica seems to be in a crashloopbackoff with no clues in the logs as to why, but it seems pretty similar to what was happening before.

There’s nothing about error when I read the logs via kubectl but via bash I can see this:

Defaulted container "mongod" out of: mongod, backup-agent, mongo-init (init)
error: unable to upgrade connection: container not found ("mongod")

This is the top command output for the failing pod:

NAME                      CPU(cores)   MEMORY(bytes)
mongo-cluster-rs0-1   2m           18Mi

The working pod for example has this top output:

NAME                      CPU(cores)   MEMORY(bytes)
mongo-cluster-rs0-0   63m          1322Mi

Here’s the full describe of the failing pod:

Name:             mongo-cluster-rs0-1
Namespace:        mongo
Priority:         0
Service Account:  default
Node:             main/116.202.211.253
Start Time:       Tue, 11 Feb 2025 20:28:18 +0000
Labels:           app.kubernetes.io/component=mongod
                  app.kubernetes.io/instance=mongo-cluster
                  app.kubernetes.io/managed-by=percona-server-mongodb-operator
                  app.kubernetes.io/name=percona-server-mongodb
                  app.kubernetes.io/part-of=percona-server-mongodb
                  app.kubernetes.io/replset=rs0
                  apps.kubernetes.io/pod-index=1
                  controller-revision-hash=mongo-cluster-rs0-596d9549fd
                  statefulset.kubernetes.io/pod-name=mongo-cluster-rs0-1
Annotations:      percona.com/ssl-hash: 3f3aad45cbb485dfa31d311d7fdddf19
                  percona.com/ssl-internal-hash: e16c53f3214fedf85baa45d1a1d0ca4b
Status:           Running
IP:               10.42.0.170
IPs:
  IP:           10.42.0.170
Controlled By:  StatefulSet/mongo-cluster-rs0
Init Containers:
  mongo-init:
    Container ID:  containerd://3fea5ebf2588450623c11ab8a813af30529fe50196e074eeaa9cc43147c269b9
    Image:         percona/percona-server-mongodb-operator:1.19.0
    Image ID:      docker.io/percona/percona-server-mongodb-operator@sha256:863f2027ed62e6be6b790647883dfc44620357c47901da92539436c449eff165
    Port:          <none>
    Host Port:     <none>
    Command:
      /init-entrypoint.sh
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Tue, 11 Feb 2025 20:28:20 +0000
      Finished:     Tue, 11 Feb 2025 20:28:20 +0000
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:     500m
      memory:  2G
    Requests:
      cpu:        500m
      memory:     2G
    Environment:  <none>
    Mounts:
      /data/db from mongod-data (rw)
      /opt/percona from bin (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-5zfvs (ro)
Containers:
  mongod:
    Container ID:  containerd://aa5fb17076f70267fadb25cb1ac0b56cc8a6909f2f7b3e72d62d1676f57f6e70
    Image:         percona/percona-server-mongodb:6.0.13
    Image ID:      docker.io/percona/percona-server-mongodb@sha256:bea427fee9477742c8c628f55d6a504602d47a0674752caf822bb1990e821b54
    Port:          27017/TCP
    Host Port:     0/TCP
    Command:
      /opt/percona/ps-entry.sh
    Args:
      --bind_ip_all
      --auth
      --dbpath=/data/db
      --port=27017
      --replSet=rs0
      --storageEngine=wiredTiger
      --relaxPermChecks
      --sslAllowInvalidCertificates
      --clusterAuthMode=x509
      --tlsMode=preferTLS
      --enableEncryption
      --encryptionKeyFile=/etc/mongodb-encryption/encryption-key
      --wiredTigerCacheSizeGB=0.43
      --wiredTigerIndexPrefixCompression=true
      --quiet
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       OOMKilled
      Exit Code:    137
      Started:      Tue, 18 Feb 2025 15:21:47 +0000
      Finished:     Tue, 18 Feb 2025 15:22:25 +0000
    Ready:          False
    Restart Count:  354
    Limits:
      cpu:     500m
      memory:  2G
    Requests:
      cpu:      500m
      memory:   2G
    Liveness:   exec [/opt/percona/mongodb-healthcheck k8s liveness --ssl --sslInsecure --sslCAFile /etc/mongodb-ssl/ca.crt --sslPEMKeyFile /tmp/tls.pem --startupDelaySeconds 7200] delay=60s timeout=10s period=30s #success=1 #failure=4
    Readiness:  exec [/opt/percona/mongodb-healthcheck k8s readiness --component mongod] delay=10s timeout=2s period=3s #success=1 #failure=8
    Environment Variables from:
      internal-mongo-cluster-users  Secret  Optional: false
    Environment:
      SERVICE_NAME:     mongo-cluster
      NAMESPACE:        mongo
      MONGODB_PORT:     27017
      MONGODB_REPLSET:  rs0
    Mounts:
      /data/db from mongod-data (rw)
      /etc/mongodb-encryption from mongo-cluster-mongodb-encryption-key (ro)
      /etc/mongodb-secrets from mongo-cluster-mongodb-keyfile (ro)
      /etc/mongodb-ssl from ssl (ro)
      /etc/mongodb-ssl-internal from ssl-internal (ro)
      /etc/users-secret from users-secret-file (rw)
      /opt/percona from bin (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-5zfvs (ro)
  backup-agent:
    Container ID:  containerd://7255753fe5c0830c1dd61da3b88dd703dc6d737809e4a354ee979269e8aae1c7
    Image:         percona/percona-backup-mongodb:2.7.0
    Image ID:      docker.io/percona/percona-backup-mongodb@sha256:4e29486419f06be69e5ce15490ff46b68cf44958c9ca716fa1eaba17cf32701b
    Port:          <none>
    Host Port:     <none>
    Command:
      /opt/percona/pbm-entry.sh
    Args:
      pbm-agent-entrypoint
    State:          Running
      Started:      Tue, 11 Feb 2025 20:28:22 +0000
    Ready:          True
    Restart Count:  0
    Environment:
      PBM_AGENT_MONGODB_USERNAME:  <set to the key 'MONGODB_BACKUP_USER' in secret 'internal-mongo-cluster-users'>      Optional: false
      PBM_AGENT_MONGODB_PASSWORD:  <set to the key 'MONGODB_BACKUP_PASSWORD' in secret 'internal-mongo-cluster-users'>  Optional: false
      PBM_MONGODB_REPLSET:         rs0
      PBM_MONGODB_PORT:            27017
      PBM_AGENT_SIDECAR:           true
      PBM_AGENT_SIDECAR_SLEEP:     5
      POD_NAME:                    mongo-cluster-rs0-1 (v1:metadata.name)
      PBM_MONGODB_URI:             mongodb://$(PBM_AGENT_MONGODB_USERNAME):$(PBM_AGENT_MONGODB_PASSWORD)@$(POD_NAME)
      PBM_AGENT_TLS_ENABLED:       true
    Mounts:
      /data/db from mongod-data (rw)
      /etc/mongodb-ssl from ssl (ro)
      /opt/percona from bin (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-5zfvs (ro)
Conditions:
  Type                        Status
  PodReadyToStartContainers   True
  Initialized                 True
  Ready                       False
  ContainersReady             False
  PodScheduled                True
Volumes:
  mongod-data:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  mongod-data-mongo-cluster-rs0-1
    ReadOnly:   false
  mongo-cluster-mongodb-keyfile:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  mongo-cluster-mongodb-keyfile
    Optional:    false
  bin:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
    SizeLimit:  <unset>
  mongo-cluster-mongodb-encryption-key:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  mongo-cluster-mongodb-encryption-key
    Optional:    false
  ssl:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  mongo-cluster-ssl
    Optional:    false
  ssl-internal:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  mongo-cluster-ssl-internal
    Optional:    true
  users-secret-file:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  internal-mongo-cluster-users
    Optional:    false
  kube-api-access-5zfvs:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason   Age                     From     Message
  ----     ------   ----                    ----     -------
  Normal   Pulling  29m (x349 over 2d23h)   kubelet  Pulling image "percona/percona-server-mongodb:6.0.13"
  Warning  BackOff  4m17s (x8693 over 33h)  kubelet  Back-off restarting failed container mongod in pod mongo-cluster-rs0-1_mongo(c72dee4a-e500-495c-9f30-d0289280a98f)

@anil.joshi
I’ve tried a few things including redeployments but for some reasaon, even one month later, the crashes continue every few minutes, i think the counter is up at 6000+ restarts solely for 1 of the replicas of rs0-1

@owlee

The pod mongo-cluster-rs0-1 seems to be impacted by OOMKilled (out-of-memory) issue.

Containers:
  mongod:
    Container ID:  containerd://aa5fb17076f70267fadb25cb1ac0b56cc8a6909f2f7b3e72d62d1676f57f6e70
    Image:         percona/percona-server-mongodb:6.0.13
    Image ID:      docker.io/percona/percona-server-mongodb@sha256:bea427fee9477742c8c628f55d6a504602d47a0674752caf822bb1990e821b54
    Port:          27017/TCP
    Host Port:     0/TCP
    Command:
      /opt/percona/ps-entry.sh
	  
	  
	....
	
	
	    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       OOMKilled

The resources seems to be using in full capacity. Please verify if the resources are equally defines under the other non-impacted pods as well. You might need to try with increased resources if that avoid the issue.

Limits:
      cpu:     500m
      memory:  2G
    Requests:
      cpu:      500m
      memory:   2G