Getting CrashLoopBackoff in rs pods when installing to vanilla k8s

Hi! Having got the operator working in minikube and OpenShift I moved on to vanilla k8s but unfortunately can’t get a working db up yet. Having followed the tutorial everything seems to be created correctly but the pods are failing after a while.

$ kubectl get pods
NAME READY STATUS RESTARTS AGE
my-cluster-name-rs0-0 0/1 CrashLoopBackOff 9 34m
my-cluster-name-rs0-1 1/1 Running 9 34m
my-cluster-name-rs0-2 1/1 Running 9 34m
percona-server-mongodb-operator-568f85969c-fl8jh 1/1 Running 0 35m
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
my-cluster-name-rs0-0 0/1 CrashLoopBackOff 9 37m
my-cluster-name-rs0-1 0/1 CrashLoopBackOff 9 36m
my-cluster-name-rs0-2 0/1 CrashLoopBackOff 9 36m
percona-server-mongodb-operator-568f85969c-fl8jh 1/1 Running 0 37m

This is on k8s v1.17 as per:

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.4", GitCommit:"67d2fcf276fcd9cf743ad4be9a9ef5828adc082f", GitTreeState:"clean", BuildDate:"2019-09-18T14:51:13Z", GoVersion:"go1.12.9", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.2", GitCommit:"59603c6e503c87169aea6106f57b9f242f64df89", GitTreeState:"clean", BuildDate:"2020-01-18T23:22:30Z", GoVersion:"go1.13.5", Compiler:"gc", Platform:"linux/amd64"}

I am seeing errors in the operator pod:

$ kubectl logs percona-server-mongodb-operator-568f85969c-fl8jh
{"level":"info","ts":1587043326.966063,"logger":"cmd","msg":"Git commit: 44e3cb883501c2adb1614df762317911d7bb16eb Git branch: master"}
{"level":"info","ts":1587043326.9661248,"logger":"cmd","msg":"Go Version: go1.12.17"}
{"level":"info","ts":1587043326.9661362,"logger":"cmd","msg":"Go OS/Arch: linux/amd64"}
{"level":"info","ts":1587043326.966145,"logger":"cmd","msg":"operator-sdk Version: v0.3.0"}
{"level":"info","ts":1587043326.966367,"logger":"leader","msg":"Trying to become the leader."}
{"level":"info","ts":1587043327.1066792,"logger":"cmd","msg":"Registering Components."}
{"level":"info","ts":1587043327.112258,"logger":"controller_psmdb","msg":"server version","platform":"kubernetes","version":"v1.17.2"}
{"level":"info","ts":1587043327.1129541,"logger":"kubebuilder.controller","msg":"Starting EventSource","controller":"psmdb-controller","source":"kind source: /, Kind="}
{"level":"info","ts":1587043327.1132038,"logger":"kubebuilder.controller","msg":"Starting EventSource","controller":"perconaservermongodbbackup-controller","source":"kind source: /, Kind="}
{"level":"info","ts":1587043327.1134188,"logger":"kubebuilder.controller","msg":"Starting EventSource","controller":"perconaservermongodbbackup-controller","source":"kind source: /, Kind="}
{"level":"info","ts":1587043327.1136668,"logger":"kubebuilder.controller","msg":"Starting EventSource","controller":"perconaservermongodbrestore-controller","source":"kind source: /, Kind="}
{"level":"info","ts":1587043327.1138349,"logger":"kubebuilder.controller","msg":"Starting EventSource","controller":"perconaservermongodbrestore-controller","source":"kind source: /, Kind="}
{"level":"info","ts":1587043327.1138775,"logger":"cmd","msg":"Starting the Cmd."}
{"level":"info","ts":1587043327.214392,"logger":"kubebuilder.controller","msg":"Starting Controller","controller":"perconaservermongodbrestore-controller"}
{"level":"info","ts":1587043327.2144375,"logger":"kubebuilder.controller","msg":"Starting Controller","controller":"perconaservermongodbbackup-controller"}
{"level":"info","ts":1587043327.2143924,"logger":"kubebuilder.controller","msg":"Starting Controller","controller":"psmdb-controller"}
{"level":"info","ts":1587043327.3160355,"logger":"kubebuilder.controller","msg":"Starting workers","controller":"perconaservermongodbrestore-controller","worker count":1}
{"level":"info","ts":1587043327.3161073,"logger":"kubebuilder.controller","msg":"Starting workers","controller":"perconaservermongodbbackup-controller","worker count":1}
{"level":"info","ts":1587043327.316146,"logger":"kubebuilder.controller","msg":"Starting workers","controller":"psmdb-controller","worker count":1}
{"level":"info","ts":1587043352.4551826,"logger":"controller_psmdb","msg":"Created a new mongo key","Request.Namespace":"psmdb","Request.Name":"my-cluster-name","KeyName":"my-cluster-name-mongodb-keyfile"}
{"level":"info","ts":1587043352.4619968,"logger":"controller_psmdb","msg":"Created a new mongo key","Request.Namespace":"psmdb","Request.Name":"my-cluster-name","KeyName":"my-cluster-name-mongodb-encryption-key"}
{"level":"error","ts":1587043352.7108507,"logger":"controller_psmdb",
"msg":"failed to reconcile cluster",
"Request.Namespace":"psmdb",
"Request.Name":"my-cluster-name",
"replset":"rs0",
"error":"handleReplsetInit:: no mongod containers in running state",
"errorVerbose":"no mongod containers in running state ...}
{"level":"error","ts":1587043352.8449605,"logger":"kubebuilder.controller",
"msg":"Reconciler error","controller":"psmdb-controller",
"request":"psmdb/my-cluster-name",
"error":"reconcile StatefulSet for rs0: update StatefulSet my-cluster-name-rs0: StatefulSet.apps \"my-cluster-name-rs0\" is invalid: spec: Forbidden: updates to statefulset spec for fields other than 'replicas', 'template', and 'updateStrategy' are forbidden",
...}

Actually connecting to mongo while the pods are up actually works but only by connecting without using credentials. Using userAdmin/userAdmin123456 results in “Authentication Denied”. The secrets set from deploy/secrets.yaml are available in the mongo pods as env vars so looks like they are picked up. I wondered if the mongodb-healthcheck wasn’t connecting because the mongo user creds weren’t being set and that was causing the pods to fail?

Some more diagnostics:

$ kubectl describe pod/my-cluster-name-rs0-0
Name: my-cluster-name-rs0-0
Namespace: psmdb
Priority: 0
Node: jkh-test-k8s-worker-2.fyre.ibm.com/10.51.4.169
Start Time: Thu, 16 Apr 2020 15:20:10 +0100
Labels: app.kubernetes.io/component=mongod
app.kubernetes.io/instance=my-cluster-name
app.kubernetes.io/managed-by=percona-server-mongodb-operator
app.kubernetes.io/name=percona-server-mongodb
app.kubernetes.io/part-of=percona-server-mongodb
app.kubernetes.io/replset=rs0
controller-revision-hash=my-cluster-name-rs0-78fddd4ffd
statefulset.kubernetes.io/pod-name=my-cluster-name-rs0-0
Annotations: percona.com/ssl-hash: 
percona.com/ssl-internal-hash: 
Status: Running
IP: 10.36.0.4
Controlled By: StatefulSet/my-cluster-name-rs0
Containers:
mongod:
Container ID: docker://4d1de9a2666e35bd547dad1a6c922874b0f7256309f3f13a59a647585d956848
Image: percona/percona-server-mongodb-operator:1.4.0-mongod4.2
Image ID: docker-pullable://percona/percona-server-mongodb-operator@sha256:d79a68524efb48d06e79e84b50870d1673cdfecc92b043d811e3a76cb0ae05ab
Port: 27017/TCP
Host Port: 0/TCP
Args:
--bind_ip_all
--auth
--dbpath=/data/db
--port=27017
--replSet=rs0
--storageEngine=wiredTiger
--relaxPermChecks
--sslAllowInvalidCertificates
--clusterAuthMode=keyFile
--keyFile=/etc/mongodb-secrets/mongodb-key
--slowms=100
--profile=1
--rateLimit=100
--enableEncryption
--encryptionKeyFile=/etc/mongodb-encryption/encryption-key
--encryptionCipherMode=AES256-CBC
--wiredTigerCacheSizeGB=0.25
--wiredTigerCollectionBlockCompressor=snappy
--wiredTigerJournalCompressor=snappy
--wiredTigerIndexPrefixCompression=true
--setParameter
ttlMonitorSleepSecs=60
--setParameter
wiredTigerConcurrentReadTransactions=128
--setParameter
wiredTigerConcurrentWriteTransactions=128
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Completed
Exit Code: 0
Started: Fri, 17 Apr 2020 00:39:56 +0100
Finished: Fri, 17 Apr 2020 00:42:54 +0100
Ready: False
Restart Count: 105
Limits:
cpu: 300m
memory: 500M
Requests:
cpu: 300m
memory: 500M
Liveness: exec [mongodb-healthcheck k8s liveness --startupDelaySeconds 7200] delay=60s timeout=5s period=30s #success=1 #failure=4
Readiness: tcp-socket :27017 delay=10s timeout=2s period=3s #success=1 #failure=8
Environment Variables from:
my-cluster-name-secrets Secret Optional: false
Environment:
SERVICE_NAME: my-cluster-name
NAMESPACE: psmdb
MONGODB_PORT: 27017
MONGODB_REPLSET: rs0
Mounts:
/data/db from mongod-data (rw)
/etc/mongodb-encryption from my-cluster-name-mongodb-encryption-key (ro)
/etc/mongodb-secrets from my-cluster-name-mongodb-keyfile (ro)
/etc/mongodb-ssl from ssl (ro)
/etc/mongodb-ssl-internal from ssl-internal (ro)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-m9pt6 (ro)
Conditions:
Type Status
Initialized True 
Ready False 
ContainersReady False 
PodScheduled True 
Volumes:
mongod-data:
Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
ClaimName: mongod-data-my-cluster-name-rs0-0
ReadOnly: false
my-cluster-name-mongodb-keyfile:
Type: Secret (a volume populated by a Secret)
SecretName: my-cluster-name-mongodb-keyfile
Optional: false
my-cluster-name-mongodb-encryption-key:
Type: Secret (a volume populated by a Secret)
SecretName: my-cluster-name-mongodb-encryption-key
Optional: false
ssl:
Type: Secret (a volume populated by a Secret)
SecretName: my-cluster-name-ssl
Optional: true
ssl-internal:
Type: Secret (a volume populated by a Secret)
SecretName: my-cluster-name-ssl-internal
Optional: true
default-token-m9pt6:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-m9pt6
Optional: false
QoS Class: Guaranteed
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning Unhealthy 6m29s (x412 over 9h) kubelet, jkh-test-k8s-worker-2.fyre.ibm.com (combined from similar events): Liveness probe failed: 2020-04-16 23:41:22.538 main.go:74 INFO ssl connection error: no reachable servers 
2020-04-16 23:41:22.539 main.go:81 FATAL Error connecting to mongodb: no reachable servers
Warning BackOff 2m34s (x1212 over 9h) kubelet, jkh-test-k8s-worker-2.fyre.ibm.com Back-off restarting failed container
<br>

Hi
Could you please provide us with CR in yaml format? We need that for tests.

Hi Ivan, attached. I basically turned off backup, commented requests and allowed unsafe configs

cr.yaml.txt (4.82 KB)

Also, I tested with k8s v1.15.11 and it works fine, so potentially an issue with me using v1.17?

Hi James
We’ve tried to reproduce the case on GKE 1.17 and found unexpected operator container behavior. However minikube 1.17 works OK.
At the moment 1.17 is not on a supported platforms list. Please use them if possible since 1.17 is not stable.

I get the same behaviour when trying the operator on DigitalOcean Kubernetes (doks) v1.18Strangely, it all works fine until I try bad credentials, then the pods drop off, they stay in Crash loop backoff for some time (5-10mins), then the replica set is rebuilt and logging in with good credentials works fine.