I have a 3 machine cluster, 8c16t 64GBram 2x250GB SSDs in backup raid. Not much on it, rarely exceeds 10% resource usage.
The problem I’m having is some pods very often fail the readiness check and I can’t connnect from the outside world to it. But internally some other pods from my HTTP services write and read from the cluster without issues.
when I run describe
on the pod I see:
Warning Unhealthy 26m (x14404 over 14d) kubelet (combined from similar events): Readiness probe failed: 2025-01-07T13:44:43.445Z INFO Running mongodb-healthcheck {"commit": "badcbc6fc9c8c590e73f98ab757c9ec7cf2b7935", "branch": "release-1-18-0"}
2025-01-07T13:44:43.445Z INFO Running Kubernetes readiness check for component {"component": "mongod"}
2025-01-07T13:44:43.445Z DEBUG MongodReadinessCheck Connecting to localhost:27017
2025-01-07T13:44:43.446Z ERROR Failed to perform check {"error": "member failed Kubernetes readiness check: dial: dial tcp [::1]:27017: connect: connection refused", "errorVerbose": "dial tcp [::1]:27017: connect: connection refused\ndial\ngithub.com/percona/percona-server-mongodb-operator/cmd/mongodb-healthcheck/healthcheck.MongodReadinessCheck\n\t/go/src/github.com/percona/percona-server-mongodb-operator/cmd/mongodb-healthcheck/healthcheck/readiness.go:38\ngithub.com/percona/percona-server-mongodb-operator/cmd/mongodb-healthcheck/tool.(*App).Run\n\t/go/src/github.com/percona/percona-server-mongodb-operator/cmd/mongodb-healthcheck/tool/tool.go:114\nmain.main\n\t/go/src/github.com/percona/percona-server-mongodb-operator/cmd/mongodb-healthcheck/main.go:67\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:271\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1695\nmember failed Kubernetes readiness check"}
main.main
/go/src/github.com/percona/percona-server-mongodb-operator/cmd/mongodb-healthcheck/main.go:68
runtime.main
/usr/local/go/src/runtime/proc.go:271
Warning BackOff 115s (x88085 over 14d) kubelet Back-off restarting failed container mongod in pod mongo-cluster-rs0-0_mongo(16427173-16c0-450a-bb99-f2014a46cc4f)
Any ideas on why this might be happening and why its not recovering from it?
It seems that some other pods also suffer from restarts but are eventually recovering fine:
NAME READY STATUS RESTARTS AGE
percona-server-mongodb-operator-7f7764cd57-xldlm 1/1 Running 2 (22d ago) 32d
mongo-cluster-rs0-0 1/2 Running 3542 (5m16s ago) 24d
mongo-cluster-rs0-1 2/2 Running 4 (22d ago) 24d
mongo-cluster-rs0-2 2/2 Running 0 24d
I have not been able to find any clues in the logss so any suggestions are appreciated.
Versions:
apiVersion: psmdb.percona.com/v1
kind: PerconaServerMongoDB
metadata:
name: mongo-cluster
finalizers:
spec:
clusterServiceDNSMode: "External"
crVersion: 1.18.0
image: percona/percona-server-mongodb:7.0.14
imagePullPolicy: Always
allowUnsafeConfigurations: false
updateStrategy: SmartUpdate
upgradeOptions:
versionServiceEndpoint: https://check.percona.com
apply: disabled
schedule: "0 2 * * *"
setFCV: false
secrets:
users: mongo-cluster-secrets
encryptionKey: mongo-cluster-mongodb-encryption-key
# tls:
# mode: preferTLS
pmm:
enabled: false
image: percona/pmm-client:2.43.2
serverHost: monitoring-service