Primary replicaset constantly restarts

Hi everyone, I am having an issue with percona mongodb server replicaset, deployed by percona mongo operator. The provisoning completed but it looks like the primary pod constantly restarts.

kubectl get po                    
NAME                                               READY   STATUS             RESTARTS   AGE
my-cluster-name-rs0-0                              0/1     CrashLoopBackOff   13         57m
my-cluster-name-rs0-1                              1/1     Running            0          56m
my-cluster-name-rs0-2                              1/1     Running            0          56m
percona-server-mongodb-operator-7ff49bcf85-vvv7l   1/1     Running            0          59m

The pod’s logs are full of this:

{"t":{"$date":"2021-07-09T14:49:35.801+00:00"},"s":"I",  "c":"REPL_HB",  "id":23974,   "ctx":"ReplCoord-27","msg":"Heartbeat failed after max retries","attr":{"target":"x.x.x.x:27017","maxHeartbeatRetries":2,"error":{"code":93,"codeName":"InvalidReplicaSetConfig","errmsg":"replica set IDs do not match, ours: 60e85504878f18f4cc284693; remote node's: 60e855651d196b68b3d5a990"}}}

I also see some failed liveness probe on

my-cluster-name-rs0-0

:

Warning  Unhealthy  59m  kubelet  Liveness probe failed: {"level":"info","msg":"Running Kubernetes liveness check for mongod","time":"2021-07-09T13:55:05Z"}
{"level":"error","msg":"replSetGetStatus returned error command replSetGetStatus requires authentication","time":"2021-07-09T13:55:05Z"}

I follow the standard deployment found on Install Percona server for MongoDB on Kubernetes. I ran it on a GKE cluster. Hope someone can help me out, thanks.

1 Like

Hello @vhphan ,

I tried to reproduce the issue by setting up GKE cluster and following this manual. I cannot reproduce the issue. All pods are healthy and no probes are failing.

Is there anything specific about your cluster or DB custom resource?

1 Like

Hi, sorry I forgot to mention. There are some custom changes compared to the original CR:

  • Sharding mode disabled
  • Expose replicaset with loadbalancer
  • Disable node anti-affinity
spec:
  allowUnsafeConfigurations: true
  replsets:
    - name: rs0
      affinity:
        antiAffinityTopologyKey: none
      expose:
        enabled: true
        exposeType: LoadBalancer
  sharding:
    enabled: false
1 Like

Tried with these changes - still works. What is the node instance type you have in GKE?
I suspect it might be something to do with resources.

1 Like

It’s e2-standard-4 with cos. But I tried to redeploy from scratch several times and the issue only appear once in a while. Quite rare actually.

Now I’m thinking that it might be a network problem related to my vpc settings. The cluster is in a private corporate network but the LB is public. From what I see, if I expose the replicaset, the members’ addresses will be the LBs’ addresses, right? The operator hasn’t implemented split horizon DNS yet, so communication between replicaset members will go outside of the cluster network?

1 Like