CrashLoopBackOff when install psmdb-operator v1.15.0

Description:

Hi all!

I was trying to deploy the Percona Operator for MongoDB using the helm chart following the instructions on

Unfortunately I’m not able to get a stable operator working for long. After a few minutes of deploy the operator throws an error and restart.

Steps to Reproduce:

The process I used for installation was

helm install mongodb-psmdb-operator percona/psmdb-operator -n mongodb-temp --version 1.15.0 -f percona.yml 

being the custom contents of the percona.yml

logLevel: DEBUG
resources:
  limits:
    cpu: 1
    memory: 1Gi
  requests:
    cpu: 1
    memory: 1Gi
disableTelemetry: true

the remaining will be the defaults for this version of the chart 1.15.0

Version:

v1.15.0
Deployed into GKE 1.25

Logs:

Logs of the operator till it crashes

2023-10-24T19:01:39.873Z        INFO    setup   Manager starting up     {"gitCommit": "ed2d8b4907c39beadfb020ce1cb555fee0ac682d", "gitBranch": "release-1-15-0", "goVersion": "go1.20.9", "os": "linux", "arch": "amd64"}
I1024 19:01:40.924853       1 request.go:697] Waited for 1.034285823s due to client-side throttling, not priority and fairness, request: GET:https://10.145.32.1:443/apis/constraints.gatekeeper.sh/v1alpha1?timeout=32s
2023-10-24T19:01:43.435Z        INFO    server version  {"platform": "kubernetes", "version": "v1.25.12-gke.500"}
2023-10-24T19:01:43.445Z        INFO    starting server {"kind": "health probe", "addr": "[::]:8081"}
2023-10-24T19:01:43.445Z        INFO    controller-runtime.metrics      Starting metrics server
I1024 19:01:43.445368       1 leaderelection.go:250] attempting to acquire leader lease mongodb-temp/08db0feb.percona.com...
2023-10-24T19:01:43.445Z        INFO    controller-runtime.metrics      Serving metrics server  {"bindAddress": ":8080", "secure": false}
I1024 19:02:01.263394       1 leaderelection.go:260] successfully acquired lease mongodb-temp/08db0feb.percona.com
2023-10-24T19:02:01.263Z        DEBUG   events  mongodb-psmdb-operator-84849858b8-mp2h9_ba6e5c4f-850a-4dac-a73b-4062eaf3861d became leader      {"type": "Normal", "object": {"kind":"Lease","namespace":"mongodb-temp","name":"08db0feb.percona.com","uid":"855ef146-fd90-41b9-8c86-cd0b82697425","apiVersion":"coordination.k8s.io/v1","resourceVersion":"2108411632"}, "reason": "LeaderElection"}
2023-10-24T19:02:01.263Z        INFO    Starting EventSource    {"controller": "psmdb-controller", "source": "kind source: *v1.PerconaServerMongoDB"}
2023-10-24T19:02:01.263Z        INFO    Starting Controller     {"controller": "psmdb-controller"}
2023-10-24T19:02:01.263Z        INFO    Starting EventSource    {"controller": "psmdbrestore-controller", "source": "kind source: *v1.PerconaServerMongoDBRestore"}
2023-10-24T19:02:01.263Z        INFO    Starting EventSource    {"controller": "psmdbbackup-controller", "source": "kind source: *v1.PerconaServerMongoDBBackup"}
2023-10-24T19:02:01.263Z        INFO    Starting EventSource    {"controller": "psmdbbackup-controller", "source": "kind source: *v1.Pod"}
2023-10-24T19:02:01.263Z        INFO    Starting EventSource    {"controller": "psmdbrestore-controller", "source": "kind source: *v1.Pod"}
2023-10-24T19:02:01.263Z        INFO    Starting Controller     {"controller": "psmdbbackup-controller"}
2023-10-24T19:02:01.263Z        INFO    Starting Controller     {"controller": "psmdbrestore-controller"}
2023-10-24T19:02:01.397Z        INFO    Starting workers        {"controller": "psmdb-controller", "worker count": 1}
2023-10-24T19:02:01.401Z        INFO    Starting workers        {"controller": "psmdbrestore-controller", "worker count": 1}
2023-10-24T19:02:01.401Z        INFO    Starting workers        {"controller": "psmdbbackup-controller", "worker count": 1}
E1024 19:03:27.804464       1 leaderelection.go:369] Failed to update lock: Put "https://10.145.32.1:443/apis/coordination.k8s.io/v1/namespaces/mongodb-temp/leases/08db0feb.percona.com": context deadline exceeded
I1024 19:03:27.804527       1 leaderelection.go:285] failed to renew lease mongodb-temp/08db0feb.percona.com: timed out waiting for the condition
2023-10-24T19:03:27.804Z        DEBUG   events  mongodb-psmdb-operator-84849858b8-mp2h9_ba6e5c4f-850a-4dac-a73b-4062eaf3861d stopped leading    {"type": "Normal", "object": {"kind":"Lease","namespace":"mongodb-temp","name":"08db0feb.percona.com","uid":"855ef146-fd90-41b9-8c86-cd0b82697425","apiVersion":"coordination.k8s.io/v1","resourceVersion":"2108413524"}, "reason": "LeaderElection"}
2023-10-24T19:03:27.804Z        INFO    Stopping and waiting for non leader election runnables
2023-10-24T19:03:27.806Z        INFO    Stopping and waiting for leader election runnables
2023-10-24T19:03:27.806Z        INFO    Stopping and waiting for caches
2023-10-24T19:03:27.804Z        ERROR   setup   problem running manager {"error": "leader election lost"}
main.main
        /go/src/github.com/percona/percona-server-mongodb-operator/cmd/manager/main.go:161
runtime.main
        /usr/local/go/src/runtime/proc.go:250

Expected Result:

To the operator pod not to go into CrashLoopBackOff

Do you have any idea, what could be the cause of this, or how can I fix it?

Hi, the operator requires a minimum of 2 GB of ram and 2 CPU threads per node. See this for more info: System requirements - Percona Operator for MongoDB

Hi Ivan, Thanks for the quick feedback

Following your suggestion I did increase the resources allocated to the operator from 1 to 4 CPU and from 1 to 4Gi of memory and redeployed the psmdb-operator. But the same situation keeps happening

E1026 06:49:40.838717       1 leaderelection.go:369] Failed to update lock: Put "https://10.145.32.1:443/apis/coordination.k8s.io/v1/namespaces/mongodb-temp/leases/08db0feb.percona.com": context deadline exceeded
I1026 06:49:40.838758       1 leaderelection.go:285] failed to renew lease mongodb-temp/08db0feb.percona.com: timed out waiting for the condition
2023-10-26T06:49:40.838Z        ERROR   setup   problem running manager {"error": "leader election lost"}
main.main
        /go/src/github.com/percona/percona-server-mongodb-operator/cmd/manager/main.go:161
runtime.main
        /usr/local/go/src/runtime/proc.go:250

I do not know the internals of the operator, but this appears to be some kind of timeout communicating with GKE API. Is there any way o tweak this timeout values?

@psamagal I deployed GKE 1.25 and used the same values YAML that you provided. Operator started just fine.

What is the size of the nodes in your kubernetes cluster?
Were there any other parameters tuned in GKE?

At the moment it’s running in nodes with the following characteristics

Image type: Container-optimised OS with containerd (cos_containerd)
Machine type: n1-standard-8
Boot disk type: Standard persistent disk
Boot disk size: 100 GB
Provisioning model: Spot

No other parameters were tunned on the operator deployment and similar for the nodes, they are pretty much default configuration.
Is there anything in particular that I should look for?

@psamagal please show the following:

  1. kubectl get pods -n mongodb-temp
  2. kubectl describe pod <OPERATOR_POD> -n mongodb-temp

Here you have the get pods. Currently the operator is the only thing running in this namespace

$ kubectl get pods -n mongodb-temp
NAME                                     READY   STATUS             RESTARTS      AGE
mongodb-psmdb-operator-75d5bbbd4-fcqss   0/1     CrashLoopBackOff   5 (28s ago)   16m

for the describe pod

$ kubectl describe pod mongodb-psmdb-operator-75d5bbbd4-fcqss -n mongodb-temp
Name:         mongodb-psmdb-operator-75d5bbbd4-fcqss
Namespace:    mongodb-temp
Priority:     0
Node:         gke-<redacted>--pool-a-n1-std-8-f8aaf4b2-njm7/10.145.8.3
Start Time:   Thu, 26 Oct 2023 14:32:31 +0200
Labels:       app.kubernetes.io/instance=mongodb-psmdb-operator
              app.kubernetes.io/name=psmdb-operator
              pod-template-hash=75d5bbbd4
Annotations:  <none>
Status:       Running
IP:           10.145.20.130
IPs:
  IP:           10.145.20.130
Controlled By:  ReplicaSet/mongodb-psmdb-operator-75d5bbbd4
Containers:
  psmdb-operator:
    Container ID:  containerd://2be8442c255ae11adc4f67583b3d04012f4b6da56fa1f839ccd301e267a03f14
    Image:         percona/percona-server-mongodb-operator:1.15.0
    Image ID:      docker.io/percona/percona-server-mongodb-operator@sha256:d8a5b33db1938d42769cb5a87d34a128332a2d0302eaa6d7c860e7c4667ea3b6
    Port:          60000/TCP
    Host Port:     0/TCP
    Command:
      percona-server-mongodb-operator
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Thu, 26 Oct 2023 14:46:58 +0200
      Finished:     Thu, 26 Oct 2023 14:47:43 +0200
    Ready:          False
    Restart Count:  5
    Limits:
      cpu:     4
      memory:  4Gi
    Requests:
      cpu:     4
      memory:  4Gi
    Environment:
      LOG_STRUCTURED:     false
      LOG_LEVEL:          DEBUG
      WATCH_NAMESPACE:    mongodb-temp
      POD_NAME:           mongodb-psmdb-operator-75d5bbbd4-fcqss (v1:metadata.name)
      OPERATOR_NAME:      percona-server-mongodb-operator
      RESYNC_PERIOD:      5s
      DISABLE_TELEMETRY:  true
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-w94fn (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  kube-api-access-w94fn:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Guaranteed
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason            Age                  From                Message
  ----     ------            ----                 ----                -------
  Normal   Scheduled         15m                  default-scheduler   Successfully assigned mongodb-temp/mongodb-psmdb-operator-75d5bbbd4-fcqss to gke-<redacted>--pool-a-n1-std-8-f8aaf4b2-njm7
  Normal   Pulling           15m                  kubelet             Pulling image "percona/percona-server-mongodb-operator:1.15.0"
  Normal   Pulled            15m                  kubelet             Successfully pulled image "percona/percona-server-mongodb-operator:1.15.0" in 4.916799598s (13.610398918s including waiting)
  Normal   Created           3m51s (x5 over 15m)  kubelet             Created container psmdb-operator
  Normal   Started           3m51s (x5 over 15m)  kubelet             Started container psmdb-operator
  Normal   Pulled            3m51s (x4 over 14m)  kubelet             Container image "percona/percona-server-mongodb-operator:1.15.0" already present on machine
  Warning  BackOff           34s (x13 over 12m)   kubelet             Back-off restarting failed container