MongoDB Cluster cannot failover when down time all pods and using mode External (NodePort and LB)

Description:

MongoDB Cluster cannot failover when down time all pods

Steps to Reproduce:

Kubectl delete (all pods in replicaset) --force -n namespace or shutdown all nodes in K8s cluster

Version:

Percona Operator for MongoDB* 1.15.0

Logs:

Logs in Operator:
2024-04-25T08:48:42.241Z ERROR failed to reconcile cluster {“controller”: “psmdb-controller”, “object”: {“name”:“mongo-psmdb-db”,“namespace”:“tungdt”}, “namespace”: “tungdt”, “name”: “mongo-psmdb-db”, “reconcileID”: “7614bf82-1bf1-43e0-b4ab-f07b6c5a358c”, “replset”: “rs0”, “error”: “dial: ping mongo: server selection error: context deadline exceeded, current topology: { Type: ReplicaSetNoPrimary, Servers: [{ Addr: mongo-psmdb-db-rs0-0.mongo-psmdb-db-rs0.tungdt.svc.cluster.local:27017, Type: Unknown, Last error: dial tcp: lookup mongo-psmdb-db-rs0-0.mongo-psmdb-db-rs0.tungdt.svc.cluster.local on 10.43.0.10:53: no such host }, { Addr: mongo-psmdb-db-rs0-1.mongo-psmdb-db-rs0.tungdt.svc.cluster.local:27017, Type: Unknown, Last error: dial tcp: lookup mongo-psmdb-db-rs0-1.mongo-psmdb-db-rs0.tungdt.svc.cluster.local on 10.43.0.10:53: no such host }, { Addr: mongo-psmdb-db-rs0-2.mongo-psmdb-db-rs0.tungdt.svc.cluster.local:27017, Type: Unknown, Last error: dial tcp: lookup mongo-psmdb-db-rs0-2.mongo-psmdb-db-rs0.tungdt.svc.cluster.local on 10.43.0.10:53: no such host }, ] }”, “errorVerbose”: “server selection error: context deadline exceeded, current topology: { Type: ReplicaSetNoPrimary, Servers: [{ Addr: mongo-psmdb-db-rs0-0.mongo-psmdb-db-rs0.tungdt.svc.cluster.local:27017, Type: Unknown, Last error: dial tcp: lookup mongo-psmdb-db-rs0-0.mongo-psmdb-db-rs0.tungdt.svc.cluster.local on 10.43.0.10:53: no such host }, { Addr: mongo-psmdb-db-rs0-1.mongo-psmdb-db-rs0.tungdt.svc.cluster.local:27017, Type: Unknown, Last error: dial tcp: lookup mongo-psmdb-db-rs0-1.mongo-psmdb-db-rs0.tungdt.svc.cluster.local on 10.43.0.10:53: no such host }, { Addr: mongo-psmdb-db-rs0-2.mongo-psmdb-db-rs0.tungdt.svc.cluster.local:27017, Type: Unknown, Last error: dial tcp: lookup mongo-psmdb-db-rs0-2.mongo-psmdb-db-rs0.tungdt.svc.cluster.local on 10.43.0.10:53: no such host }, ] }\nping mongo\ngithub.com/percona/percona-server-mongodb-operator/pkg/psmdb/mongo.Dial\n\t/go/src/github.com/percona/percona-server-mongodb-operator/pkg/psmdb/mongo/mongo.go:112\ngithub.com/percona/percona-server-mongodb-operator/pkg/psmdb.MongoClient\n\t/go/src/github.com/percona/percona-server-mongodb-operator/pkg/psmdb/client.go:62\ngithub.com/percona/percona-server-mongodb-operator/pkg/controller/perconaservermongodb.(*mongoClientProvider).Mongo\n\t/go/src/github.com/percona/percona-server-mongodb-operator/pkg/controller/perconaservermongodb/connections.go:38\ngithub.com/percona/percona-server-mongodb-operator/pkg/controller/perconaservermongodb.(*ReconcilePerconaServerMongoDB).mongoClientWithRole\n\t/go/src/github.com/percona/percona-server-mongodb-operator/pkg/controller/perconaservermongodb/connections.go:60\ngithub.com/percona/percona-server-mongodb-operator/pkg/controller/perconaservermongodb.(*ReconcilePerconaServerMongoDB).reconcileCluster\n\t/go/src/github.com/percona/percona-server-mongodb-operator/pkg/controller/perconaservermongodb/mgo.go:87\ngithub.com/percona/percona-server-mongodb-operator/pkg/controller/perconaservermongodb.(*ReconcilePerconaServerMongoDB).Reconcile\n\t/go/src/github.com/percona/percona-server-mongodb-operator/pkg/controller/perconaservermongodb/psmdb_controller.go:498\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.1/pkg/internal/controller/controller.go:119\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.1/pkg/internal/controller/controller.go:316\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.1/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.1/pkg/internal/controller/controller.go:227\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1598\ndial\ngithub.com/percona/percona-server-mongodb-operator/pkg/controller/perconaservermongodb.(*ReconcilePerconaServerMongoDB).reconcileCluster\n\t/go/src/github.com/percona/percona-server-mongodb-operator/pkg/controller/perconaservermongodb/mgo.go:93\ngithub.com/percona/percona-server-mongodb-operator/pkg/controller/perconaservermongodb.(*ReconcilePerconaServerMongoDB).Reconcile\n\t/go/src/github.com/percona/percona-server-mongodb-operator/pkg/controller/perconaservermongodb/psmdb_controller.go:498\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.1/pkg/internal/controller/controller.go:119\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.1/pkg/internal/controller/controller.go:316\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.1/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.1/pkg/internal/controller/controller.go:227\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1598”} github.com/percona/percona-server-mongodb-operator/pkg/controller/perconaservermongodb.(*ReconcilePerconaServerMongoDB).Reconcile
/go/src/github.com/percona/percona-server-mongodb-operator/pkg/controller/perconaservermongodb/psmdb_controller.go:500

Expected Result:

The cluster returns to normal operation and rs.status() displays information about the ready state of the cluster

Actual Result:

The cluster enters the RS Ghost state and becomes inoperable

Additional Information:

Can anyone help me to fix issue?

Hello @Trinh.Duc.Chung ,

I reproduced it. Will discuss with the team internally and get back to you.

UPD: I believe you also submitted this PR: K8SPSMDB-1074 fixed MongoDB Cluster cannot failover when down time all pods using mode External (NodePort and LB) by chungtd by chungtd203338 · Pull Request #1535 · percona/percona-server-mongodb-operator · GitHub

We will have a look.

Also a quick question - is there any reason why you need to use External mode vs using splitHorizons?

1 Like

Hi Mr.@Sergey_Pronin!
Thank you for your response regarding my issue!

I indeed created that pull request. I discovered how to resolve this error by consulting the official Mongodb documentation. You can view it here: https://www.mongodb.com/docs/kubernetes-operator/master/tutorial/connect-from-outside-k8s/.

As for why I need to use External mode instead of using splitHorizons because some of our infrastructure does not have loadbalancer, besides splithorizons Mongodb operator is fixing port 27017, we will have cases where we need to deploy nodeport mode and Use service loadbalancer with port other than 27017.

Looking forward to the response on my pull request creation. We are in need of an operator that can fix this error as soon as possible. Thanks !