I have a 3 machine cluster, 8c16t 64GBram 2x250GB SSDs in backup raid. Not much on it, rarely exceeds 10% resource usage.
The problem I’m having is some pods very often fail the readiness check and I can’t connnect from the outside world to it. But internally some other pods from my HTTP services write and read from the cluster without issues.
when I run describe on the pod I see:
Warning Unhealthy 26m (x14404 over 14d) kubelet (combined from similar events): Readiness probe failed: 2025-01-07T13:44:43.445Z INFO Running mongodb-healthcheck {"commit": "badcbc6fc9c8c590e73f98ab757c9ec7cf2b7935", "branch": "release-1-18-0"}
2025-01-07T13:44:43.445Z INFO Running Kubernetes readiness check for component {"component": "mongod"}
2025-01-07T13:44:43.445Z DEBUG MongodReadinessCheck Connecting to localhost:27017
2025-01-07T13:44:43.446Z ERROR Failed to perform check {"error": "member failed Kubernetes readiness check: dial: dial tcp [::1]:27017: connect: connection refused", "errorVerbose": "dial tcp [::1]:27017: connect: connection refused\ndial\ngithub.com/percona/percona-server-mongodb-operator/cmd/mongodb-healthcheck/healthcheck.MongodReadinessCheck\n\t/go/src/github.com/percona/percona-server-mongodb-operator/cmd/mongodb-healthcheck/healthcheck/readiness.go:38\ngithub.com/percona/percona-server-mongodb-operator/cmd/mongodb-healthcheck/tool.(*App).Run\n\t/go/src/github.com/percona/percona-server-mongodb-operator/cmd/mongodb-healthcheck/tool/tool.go:114\nmain.main\n\t/go/src/github.com/percona/percona-server-mongodb-operator/cmd/mongodb-healthcheck/main.go:67\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:271\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1695\nmember failed Kubernetes readiness check"}
main.main
/go/src/github.com/percona/percona-server-mongodb-operator/cmd/mongodb-healthcheck/main.go:68
runtime.main
/usr/local/go/src/runtime/proc.go:271
Warning BackOff 115s (x88085 over 14d) kubelet Back-off restarting failed container mongod in pod mongo-cluster-rs0-0_mongo(16427173-16c0-450a-bb99-f2014a46cc4f)
Any ideas on why this might be happening and why its not recovering from it?
It seems that some other pods also suffer from restarts but are eventually recovering fine:
Yes, its failing the readiness check. Did you verify the connectivity by directly connecting to the pod (cluster-rs0-0 ) and checking the error logs/configurations etc ? Did you get anything in kubectl logs mongo-cluster-rs0-0 ?
@anil.joshi sorry for the late reply, I was waiting for an error to happen again
After my post I had the idea to downgrade to an older mongo version from 8.x/7.x to 6.x. I don’t know the exact version but if it helps I can look them up.
Things were good since then for almost a month and a half, then, 6 days the operator did something routine, not sure if it updated anything as all pods age show up as 6 days of age. But my 2nd replica seems to be in a crashloopbackoff with no clues in the logs as to why, but it seems pretty similar to what was happening before.
There’s nothing about error when I read the logs via kubectl but via bash I can see this:
Defaulted container "mongod" out of: mongod, backup-agent, mongo-init (init)
error: unable to upgrade connection: container not found ("mongod")
This is the top command output for the failing pod:
NAME CPU(cores) MEMORY(bytes)
mongo-cluster-rs0-1 2m 18Mi
The working pod for example has this top output:
NAME CPU(cores) MEMORY(bytes)
mongo-cluster-rs0-0 63m 1322Mi
Here’s the full describe of the failing pod:
Name: mongo-cluster-rs0-1
Namespace: mongo
Priority: 0
Service Account: default
Node: main/116.202.211.253
Start Time: Tue, 11 Feb 2025 20:28:18 +0000
Labels: app.kubernetes.io/component=mongod
app.kubernetes.io/instance=mongo-cluster
app.kubernetes.io/managed-by=percona-server-mongodb-operator
app.kubernetes.io/name=percona-server-mongodb
app.kubernetes.io/part-of=percona-server-mongodb
app.kubernetes.io/replset=rs0
apps.kubernetes.io/pod-index=1
controller-revision-hash=mongo-cluster-rs0-596d9549fd
statefulset.kubernetes.io/pod-name=mongo-cluster-rs0-1
Annotations: percona.com/ssl-hash: 3f3aad45cbb485dfa31d311d7fdddf19
percona.com/ssl-internal-hash: e16c53f3214fedf85baa45d1a1d0ca4b
Status: Running
IP: 10.42.0.170
IPs:
IP: 10.42.0.170
Controlled By: StatefulSet/mongo-cluster-rs0
Init Containers:
mongo-init:
Container ID: containerd://3fea5ebf2588450623c11ab8a813af30529fe50196e074eeaa9cc43147c269b9
Image: percona/percona-server-mongodb-operator:1.19.0
Image ID: docker.io/percona/percona-server-mongodb-operator@sha256:863f2027ed62e6be6b790647883dfc44620357c47901da92539436c449eff165
Port: <none>
Host Port: <none>
Command:
/init-entrypoint.sh
State: Terminated
Reason: Completed
Exit Code: 0
Started: Tue, 11 Feb 2025 20:28:20 +0000
Finished: Tue, 11 Feb 2025 20:28:20 +0000
Ready: True
Restart Count: 0
Limits:
cpu: 500m
memory: 2G
Requests:
cpu: 500m
memory: 2G
Environment: <none>
Mounts:
/data/db from mongod-data (rw)
/opt/percona from bin (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-5zfvs (ro)
Containers:
mongod:
Container ID: containerd://aa5fb17076f70267fadb25cb1ac0b56cc8a6909f2f7b3e72d62d1676f57f6e70
Image: percona/percona-server-mongodb:6.0.13
Image ID: docker.io/percona/percona-server-mongodb@sha256:bea427fee9477742c8c628f55d6a504602d47a0674752caf822bb1990e821b54
Port: 27017/TCP
Host Port: 0/TCP
Command:
/opt/percona/ps-entry.sh
Args:
--bind_ip_all
--auth
--dbpath=/data/db
--port=27017
--replSet=rs0
--storageEngine=wiredTiger
--relaxPermChecks
--sslAllowInvalidCertificates
--clusterAuthMode=x509
--tlsMode=preferTLS
--enableEncryption
--encryptionKeyFile=/etc/mongodb-encryption/encryption-key
--wiredTigerCacheSizeGB=0.43
--wiredTigerIndexPrefixCompression=true
--quiet
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: OOMKilled
Exit Code: 137
Started: Tue, 18 Feb 2025 15:21:47 +0000
Finished: Tue, 18 Feb 2025 15:22:25 +0000
Ready: False
Restart Count: 354
Limits:
cpu: 500m
memory: 2G
Requests:
cpu: 500m
memory: 2G
Liveness: exec [/opt/percona/mongodb-healthcheck k8s liveness --ssl --sslInsecure --sslCAFile /etc/mongodb-ssl/ca.crt --sslPEMKeyFile /tmp/tls.pem --startupDelaySeconds 7200] delay=60s timeout=10s period=30s #success=1 #failure=4
Readiness: exec [/opt/percona/mongodb-healthcheck k8s readiness --component mongod] delay=10s timeout=2s period=3s #success=1 #failure=8
Environment Variables from:
internal-mongo-cluster-users Secret Optional: false
Environment:
SERVICE_NAME: mongo-cluster
NAMESPACE: mongo
MONGODB_PORT: 27017
MONGODB_REPLSET: rs0
Mounts:
/data/db from mongod-data (rw)
/etc/mongodb-encryption from mongo-cluster-mongodb-encryption-key (ro)
/etc/mongodb-secrets from mongo-cluster-mongodb-keyfile (ro)
/etc/mongodb-ssl from ssl (ro)
/etc/mongodb-ssl-internal from ssl-internal (ro)
/etc/users-secret from users-secret-file (rw)
/opt/percona from bin (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-5zfvs (ro)
backup-agent:
Container ID: containerd://7255753fe5c0830c1dd61da3b88dd703dc6d737809e4a354ee979269e8aae1c7
Image: percona/percona-backup-mongodb:2.7.0
Image ID: docker.io/percona/percona-backup-mongodb@sha256:4e29486419f06be69e5ce15490ff46b68cf44958c9ca716fa1eaba17cf32701b
Port: <none>
Host Port: <none>
Command:
/opt/percona/pbm-entry.sh
Args:
pbm-agent-entrypoint
State: Running
Started: Tue, 11 Feb 2025 20:28:22 +0000
Ready: True
Restart Count: 0
Environment:
PBM_AGENT_MONGODB_USERNAME: <set to the key 'MONGODB_BACKUP_USER' in secret 'internal-mongo-cluster-users'> Optional: false
PBM_AGENT_MONGODB_PASSWORD: <set to the key 'MONGODB_BACKUP_PASSWORD' in secret 'internal-mongo-cluster-users'> Optional: false
PBM_MONGODB_REPLSET: rs0
PBM_MONGODB_PORT: 27017
PBM_AGENT_SIDECAR: true
PBM_AGENT_SIDECAR_SLEEP: 5
POD_NAME: mongo-cluster-rs0-1 (v1:metadata.name)
PBM_MONGODB_URI: mongodb://$(PBM_AGENT_MONGODB_USERNAME):$(PBM_AGENT_MONGODB_PASSWORD)@$(POD_NAME)
PBM_AGENT_TLS_ENABLED: true
Mounts:
/data/db from mongod-data (rw)
/etc/mongodb-ssl from ssl (ro)
/opt/percona from bin (ro)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-5zfvs (ro)
Conditions:
Type Status
PodReadyToStartContainers True
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
mongod-data:
Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
ClaimName: mongod-data-mongo-cluster-rs0-1
ReadOnly: false
mongo-cluster-mongodb-keyfile:
Type: Secret (a volume populated by a Secret)
SecretName: mongo-cluster-mongodb-keyfile
Optional: false
bin:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
mongo-cluster-mongodb-encryption-key:
Type: Secret (a volume populated by a Secret)
SecretName: mongo-cluster-mongodb-encryption-key
Optional: false
ssl:
Type: Secret (a volume populated by a Secret)
SecretName: mongo-cluster-ssl
Optional: false
ssl-internal:
Type: Secret (a volume populated by a Secret)
SecretName: mongo-cluster-ssl-internal
Optional: true
users-secret-file:
Type: Secret (a volume populated by a Secret)
SecretName: internal-mongo-cluster-users
Optional: false
kube-api-access-5zfvs:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Pulling 29m (x349 over 2d23h) kubelet Pulling image "percona/percona-server-mongodb:6.0.13"
Warning BackOff 4m17s (x8693 over 33h) kubelet Back-off restarting failed container mongod in pod mongo-cluster-rs0-1_mongo(c72dee4a-e500-495c-9f30-d0289280a98f)
@anil.joshi
I’ve tried a few things including redeployments but for some reasaon, even one month later, the crashes continue every few minutes, i think the counter is up at 6000+ restarts solely for 1 of the replicas of rs0-1
The resources seems to be using in full capacity. Please verify if the resources are equally defines under the other non-impacted pods as well. You might need to try with increased resources if that avoid the issue.