Hello,
I faced this issue a couple of times.
Environment
- Kubernetes 1.21
- MongoDB 3.6
- MongoDB Operator 1.9.0
Problem
The mongodb cluster / replicaset is running fine as long as nothing happens to the pods after the first run. So if I scale my nodes and the mongod processes get restarted, the cluster won’t recover. All 3 pods are restarting with signal 15 received and hanging in CrashLoopBackOff.
I really can’t figure out why this is happening. Maybe someone has an idea?
If you need more information, logs, configuration, tell me and I will share it.
Logs
MongoDB logs
Shutdown log message:
2021-08-20T08:43:33.274+0000 I CONTROL [signalProcessingThread] got signal 15 (Terminated), will terminate after current cmd ends
2021-08-20T08:43:33.275+0000 I NETWORK [signalProcessingThread] shutdown: going to close listening sockets...
2021-08-20T08:43:33.275+0000 I NETWORK [signalProcessingThread] removing socket file: /tmp/mongodb-27017.sock
2021-08-20T08:43:33.275+0000 I REPL [signalProcessingThread] shutting down replication subsystems
2021-08-20T08:43:33.275+0000 I REPL [signalProcessingThread] Stopping replication reporter thread
2021-08-20T08:43:33.275+0000 I REPL [signalProcessingThread] Stopping replication fetcher thread
2021-08-20T08:43:33.275+0000 I REPL [signalProcessingThread] Stopping replication applier thread
2021-08-20T08:43:33.450+0000 I REPL [signalProcessingThread] Stopping replication storage threads
2021-08-20T08:43:33.452+0000 I FTDC [signalProcessingThread] Shutting down full-time diagnostic data capture
2021-08-20T08:43:33.456+0000 I STORAGE [WTOplogJournalThread] oplog journal thread loop shutting down
2021-08-20T08:43:33.456+0000 I STORAGE [signalProcessingThread] WiredTigerKVEngine shutting down
2021-08-20T08:43:34.004+0000 I STORAGE [signalProcessingThread] shutdown: removing fs lock...
2021-08-20T08:43:34.004+0000 I CONTROL [signalProcessingThread] now exiting
2021-08-20T08:43:34.004+0000 I CONTROL [signalProcessingThread] shutting down with code:0
Operator logs
{"level":"error","ts":1629449899.757564,"logger":"controller_psmdb","msg":"failed to reconcile cluster","Request.Namespace":"my-namespace","Request.Name":"mongodb-base","replset":"instance","error":"dial:: failed to ping mongo: context deadline exceeded","errorVerbose":"failed to ping mongo: context deadline exceeded\ngithub.com/percona/percona-server-mongodb-operator/pkg/psmdb/mongo.Dial\n\t/go/src/github.com/percona/percona-server-mongodb-operator/pkg/psmdb/mongo/mongo.go:61\ngithub.com/percona/percona-server-mongodb-operator/pkg/controller/perconaservermongodb.(*ReconcilePerconaServerMongoDB).mongoClient\n\t/go/src/github.com/percona/percona-server-mongodb-operator/pkg/controller/perconaservermongodb/connections.go:59\ngithub.com/percona/percona-server-mongodb-operator/pkg/controller/perconaservermongodb.(*ReconcilePerconaServerMongoDB).mongoClientWithRole\n\t/go/src/github.com/percona/percona-server-mongodb-operator/pkg/controller/perconaservermongodb/connections.go:27\ngithub.com/percona/percona-server-mongodb-operator/pkg/controller/perconaservermongodb.(*ReconcilePerconaServerMongoDB).reconcileCluster\n\t/go/src/github.com/percona/percona-server-mongodb-operator/pkg/controller/perconaservermongodb/mgo.go:27\ngithub.com/percona/percona-server-mongodb-operator/pkg/controller/perconaservermongodb.(*ReconcilePerconaServerMongoDB).Reconcile\n\t/go/src/github.com/percona/percona-server-mongodb-operator/pkg/controller/perconaservermongodb/psmdb_controller.go:428\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/src/github.com/percona/percona-server-mongodb-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:256\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/src/github.com/percona/percona-server-mongodb-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:232\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker\n\t/go/src/github.com/percona/percona-server-mongodb-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:211\nk8s.io/apimachinery/pkg/util/wait.JitterUntil.func1\n\t/go/src/github.com/percona/percona-server-mongodb-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:152\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/src/github.com/percona/percona-server-mongodb-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:153\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/go/src/github.com/percona/percona-server-mongodb-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1371\ndial:\ngithub.com/percona/percona-server-mongodb-operator/pkg/controller/perconaservermongodb.(*ReconcilePerconaServerMongoDB).reconcileCluster\n\t/go/src/github.com/percona/percona-server-mongodb-operator/pkg/controller/perconaservermongodb/mgo.go:31\ngithub.com/percona/percona-server-mongodb-operator/pkg/controller/perconaservermongodb.(*ReconcilePerconaServerMongoDB).Reconcile\n\t/go/src/github.com/percona/percona-server-mongodb-operator/pkg/controller/perconaservermongodb/psmdb_controller.go:428\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/src/github.com/percona/percona-server-mongodb-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:256\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/src/github.com/percona/percona-server-mongodb-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:232\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker\n\t/go/src/github.com/percona/percona-server-mongodb-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:211\nk8s.io/apimachinery/pkg/util/wait.JitterUntil.func1\n\t/go/src/github.com/percona/percona-server-mongodb-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:152\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/src/github.com/percona/percona-server-mongodb-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:153\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/go/src/github.com/percona/percona-server-mongodb-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1371","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/src/github.com/percona/percona-server-mongodb-operator/vendor/github.com/go-logr/zapr/zapr.go:128\ngithub.com/percona/percona-server-mongodb-operator/pkg/controller/perconaservermongodb.(*ReconcilePerconaServerMongoDB).Reconcile\n\t/go/src/github.com/percona/percona-server-mongodb-operator/pkg/controller/perconaservermongodb/psmdb_controller.go:430\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/src/github.com/percona/percona-server-mongodb-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:256\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/src/github.com/percona/percona-server-mongodb-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:232\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker\n\t/go/src/github.com/percona/percona-server-mongodb-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:211\nk8s.io/apimachinery/pkg/util/wait.JitterUntil.func1\n\t/go/src/github.com/percona/percona-server-mongodb-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:152\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/src/github.com/percona/percona-server-mongodb-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:153\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/go/src/github.com/percona/percona-server-mongodb-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88"}
Edit:
I worked a bit around and could catch an instance before crashing.
This is the rs.status()
of the crashing member:
instance:OTHER> rs.status()
{
"state" : 10,
"stateStr" : "REMOVED",
"uptime" : 132,
"optime" : {
"ts" : Timestamp(1629443538, 3),
"t" : NumberLong(2)
},
"optimeDate" : ISODate("2021-08-20T07:12:18Z"),
"lastHeartbeatMessage" : "",
"syncingTo" : "",
"syncSourceHost" : "",
"syncSourceId" : -1,
"infoMessage" : "",
"ok" : 0,
"errmsg" : "Our replica set config is invalid or we are not a member of it",
"code" : 93,
"codeName" : "InvalidReplicaSetConfig",
"operationTime" : Timestamp(1629443538, 3),
"$clusterTime" : {
"clusterTime" : Timestamp(1629444174, 4),
"signature" : {
"hash" : BinData(0,"qDg4XbH1EG68xNyKNof6OKYyQP4="),
"keyId" : NumberLong("6997429197601767425")
}
}
}