Pods occasionally fail readiness check, can't find out why, but cluster otherwise works?

I have a 3 machine cluster, 8c16t 64GBram 2x250GB SSDs in backup raid. Not much on it, rarely exceeds 10% resource usage.

The problem I’m having is some pods very often fail the readiness check and I can’t connnect from the outside world to it. But internally some other pods from my HTTP services write and read from the cluster without issues.

when I run describe on the pod I see:

  Warning                 Unhealthy  26m (x14404 over 14d)                             kubelet  (combined from similar events): Readiness probe failed: 2025-01-07T13:44:43.445Z  INFO  Running mongodb-healthcheck  {"commit": "badcbc6fc9c8c590e73f98ab757c9ec7cf2b7935", "branch": "release-1-18-0"}
2025-01-07T13:44:43.445Z  INFO       Running Kubernetes readiness check for component  {"component": "mongod"}
2025-01-07T13:44:43.445Z  DEBUG      MongodReadinessCheck                              Connecting to localhost:27017
2025-01-07T13:44:43.446Z  ERROR      Failed to perform check                           {"error": "member failed Kubernetes readiness check: dial: dial tcp [::1]:27017: connect: connection refused", "errorVerbose": "dial tcp [::1]:27017: connect: connection refused\ndial\ngithub.com/percona/percona-server-mongodb-operator/cmd/mongodb-healthcheck/healthcheck.MongodReadinessCheck\n\t/go/src/github.com/percona/percona-server-mongodb-operator/cmd/mongodb-healthcheck/healthcheck/readiness.go:38\ngithub.com/percona/percona-server-mongodb-operator/cmd/mongodb-healthcheck/tool.(*App).Run\n\t/go/src/github.com/percona/percona-server-mongodb-operator/cmd/mongodb-healthcheck/tool/tool.go:114\nmain.main\n\t/go/src/github.com/percona/percona-server-mongodb-operator/cmd/mongodb-healthcheck/main.go:67\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:271\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1695\nmember failed Kubernetes readiness check"}
main.main
  /go/src/github.com/percona/percona-server-mongodb-operator/cmd/mongodb-healthcheck/main.go:68
runtime.main
           /usr/local/go/src/runtime/proc.go:271
  Warning  BackOff  115s (x88085 over 14d)  kubelet  Back-off restarting failed container mongod in pod mongo-cluster-rs0-0_mongo(16427173-16c0-450a-bb99-f2014a46cc4f)

Any ideas on why this might be happening and why its not recovering from it?

It seems that some other pods also suffer from restarts but are eventually recovering fine:

NAME                                               READY   STATUS    RESTARTS           AGE
percona-server-mongodb-operator-7f7764cd57-xldlm   1/1     Running   2 (22d ago)        32d
mongo-cluster-rs0-0                            1/2     Running   3542 (5m16s ago)   24d
mongo-cluster-rs0-1                            2/2     Running   4 (22d ago)        24d
mongo-cluster-rs0-2                            2/2     Running   0                  24d

I have not been able to find any clues in the logss so any suggestions are appreciated.

Versions:

apiVersion: psmdb.percona.com/v1
kind: PerconaServerMongoDB
metadata:
  name: mongo-cluster
  finalizers:
spec:
  clusterServiceDNSMode: "External"
  crVersion: 1.18.0
  image: percona/percona-server-mongodb:7.0.14
  imagePullPolicy: Always
  allowUnsafeConfigurations: false
  updateStrategy: SmartUpdate
  upgradeOptions:
    versionServiceEndpoint: https://check.percona.com
    apply: disabled
    schedule: "0 2 * * *"
    setFCV: false
  secrets:
    users: mongo-cluster-secrets
    encryptionKey: mongo-cluster-mongodb-encryption-key
  # tls:
  #   mode: preferTLS
  pmm:
    enabled: false
    image: percona/pmm-client:2.43.2
    serverHost: monitoring-service

@owlee

Yes, its failing the readiness check. Did you verify the connectivity by directly connecting to the pod (cluster-rs0-0 ) and checking the error logs/configurations etc ? Did you get anything in kubectl logs mongo-cluster-rs0-0 ?

2025-01-07T13:44:43.446Z  ERROR      Failed to perform check                           {"error": "member failed Kubernetes readiness check: dial: dial tcp [::1]:27017: connect: connection refused", "errorVerbose": "dial tcp [::1]:27017: connect: connection refused\ndial\ngithub.com/percona/percona-server-mongodb-operator/cmd/mongodb-healthcheck/healthcheck.MongodReadinessCheck\n\t/go/src/github.com/percona/percona-server-mongodb-operator/cmd/mongodb-healthcheck/healthcheck/readiness.go:38\ngithub.com/percona/percona-server-mongodb-operator/cmd/mongodb-healthcheck/tool.(*App).Run\n\t/go/src/github.com/percona/percona-server-mongodb-operator/cmd/mongodb-healthcheck/tool/tool.go:114\nmain.main\n\t/go/src/github.com/percona/percona-server-mongodb-operator/cmd/mongodb-healthcheck/main.go:67\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:271\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1695\nmember failed Kubernetes readiness check"}

...

  Warning  BackOff  115s (x88085 over 14d)  kubelet  Back-off restarting failed container mongod in pod mongo-cluster-rs0-0_mongo(16427173-16c0-450a-bb99-f2014a46cc4f)

Inside the pod you can verify the service and other information as below.

kubectl exec -it mongo-cluster-rs0-0 -- bash
bash> ps aux | grep mongod

This helps in getting the resource related information,
kubectl top pod mongo-cluster-rs0-0

Have you tried deleting the pod/pvc also ? This will re-initialize the pod again.

I see you are using v1.18.0 of Percona mongo operator. Is that some customized image or the original one ?

Can you please share the full output of kubectl describe pod mongo-cluster-rs0-0 and the deployment file if possible ?