Kubernetes PSMDB shutdown signal 15

Hello,

I faced this issue a couple of times.

Environment

  • Kubernetes 1.21
  • MongoDB 3.6
  • MongoDB Operator 1.9.0

Problem

The mongodb cluster / replicaset is running fine as long as nothing happens to the pods after the first run. So if I scale my nodes and the mongod processes get restarted, the cluster won’t recover. All 3 pods are restarting with signal 15 received and hanging in CrashLoopBackOff.

I really can’t figure out why this is happening. Maybe someone has an idea?

If you need more information, logs, configuration, tell me and I will share it.

Logs

MongoDB logs

Shutdown log message:

2021-08-20T08:43:33.274+0000 I CONTROL  [signalProcessingThread] got signal 15 (Terminated), will terminate after current cmd ends
2021-08-20T08:43:33.275+0000 I NETWORK  [signalProcessingThread] shutdown: going to close listening sockets...
2021-08-20T08:43:33.275+0000 I NETWORK  [signalProcessingThread] removing socket file: /tmp/mongodb-27017.sock
2021-08-20T08:43:33.275+0000 I REPL     [signalProcessingThread] shutting down replication subsystems
2021-08-20T08:43:33.275+0000 I REPL     [signalProcessingThread] Stopping replication reporter thread
2021-08-20T08:43:33.275+0000 I REPL     [signalProcessingThread] Stopping replication fetcher thread
2021-08-20T08:43:33.275+0000 I REPL     [signalProcessingThread] Stopping replication applier thread
2021-08-20T08:43:33.450+0000 I REPL     [signalProcessingThread] Stopping replication storage threads
2021-08-20T08:43:33.452+0000 I FTDC     [signalProcessingThread] Shutting down full-time diagnostic data capture
2021-08-20T08:43:33.456+0000 I STORAGE  [WTOplogJournalThread] oplog journal thread loop shutting down
2021-08-20T08:43:33.456+0000 I STORAGE  [signalProcessingThread] WiredTigerKVEngine shutting down
2021-08-20T08:43:34.004+0000 I STORAGE  [signalProcessingThread] shutdown: removing fs lock...
2021-08-20T08:43:34.004+0000 I CONTROL  [signalProcessingThread] now exiting
2021-08-20T08:43:34.004+0000 I CONTROL  [signalProcessingThread] shutting down with code:0

Operator logs

{"level":"error","ts":1629449899.757564,"logger":"controller_psmdb","msg":"failed to reconcile cluster","Request.Namespace":"my-namespace","Request.Name":"mongodb-base","replset":"instance","error":"dial:: failed to ping mongo: context deadline exceeded","errorVerbose":"failed to ping mongo: context deadline exceeded\ngithub.com/percona/percona-server-mongodb-operator/pkg/psmdb/mongo.Dial\n\t/go/src/github.com/percona/percona-server-mongodb-operator/pkg/psmdb/mongo/mongo.go:61\ngithub.com/percona/percona-server-mongodb-operator/pkg/controller/perconaservermongodb.(*ReconcilePerconaServerMongoDB).mongoClient\n\t/go/src/github.com/percona/percona-server-mongodb-operator/pkg/controller/perconaservermongodb/connections.go:59\ngithub.com/percona/percona-server-mongodb-operator/pkg/controller/perconaservermongodb.(*ReconcilePerconaServerMongoDB).mongoClientWithRole\n\t/go/src/github.com/percona/percona-server-mongodb-operator/pkg/controller/perconaservermongodb/connections.go:27\ngithub.com/percona/percona-server-mongodb-operator/pkg/controller/perconaservermongodb.(*ReconcilePerconaServerMongoDB).reconcileCluster\n\t/go/src/github.com/percona/percona-server-mongodb-operator/pkg/controller/perconaservermongodb/mgo.go:27\ngithub.com/percona/percona-server-mongodb-operator/pkg/controller/perconaservermongodb.(*ReconcilePerconaServerMongoDB).Reconcile\n\t/go/src/github.com/percona/percona-server-mongodb-operator/pkg/controller/perconaservermongodb/psmdb_controller.go:428\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/src/github.com/percona/percona-server-mongodb-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:256\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/src/github.com/percona/percona-server-mongodb-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:232\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker\n\t/go/src/github.com/percona/percona-server-mongodb-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:211\nk8s.io/apimachinery/pkg/util/wait.JitterUntil.func1\n\t/go/src/github.com/percona/percona-server-mongodb-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:152\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/src/github.com/percona/percona-server-mongodb-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:153\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/go/src/github.com/percona/percona-server-mongodb-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1371\ndial:\ngithub.com/percona/percona-server-mongodb-operator/pkg/controller/perconaservermongodb.(*ReconcilePerconaServerMongoDB).reconcileCluster\n\t/go/src/github.com/percona/percona-server-mongodb-operator/pkg/controller/perconaservermongodb/mgo.go:31\ngithub.com/percona/percona-server-mongodb-operator/pkg/controller/perconaservermongodb.(*ReconcilePerconaServerMongoDB).Reconcile\n\t/go/src/github.com/percona/percona-server-mongodb-operator/pkg/controller/perconaservermongodb/psmdb_controller.go:428\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/src/github.com/percona/percona-server-mongodb-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:256\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/src/github.com/percona/percona-server-mongodb-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:232\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker\n\t/go/src/github.com/percona/percona-server-mongodb-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:211\nk8s.io/apimachinery/pkg/util/wait.JitterUntil.func1\n\t/go/src/github.com/percona/percona-server-mongodb-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:152\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/src/github.com/percona/percona-server-mongodb-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:153\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/go/src/github.com/percona/percona-server-mongodb-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1371","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/src/github.com/percona/percona-server-mongodb-operator/vendor/github.com/go-logr/zapr/zapr.go:128\ngithub.com/percona/percona-server-mongodb-operator/pkg/controller/perconaservermongodb.(*ReconcilePerconaServerMongoDB).Reconcile\n\t/go/src/github.com/percona/percona-server-mongodb-operator/pkg/controller/perconaservermongodb/psmdb_controller.go:430\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/src/github.com/percona/percona-server-mongodb-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:256\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/src/github.com/percona/percona-server-mongodb-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:232\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker\n\t/go/src/github.com/percona/percona-server-mongodb-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:211\nk8s.io/apimachinery/pkg/util/wait.JitterUntil.func1\n\t/go/src/github.com/percona/percona-server-mongodb-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:152\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/src/github.com/percona/percona-server-mongodb-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:153\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/go/src/github.com/percona/percona-server-mongodb-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88"}

Edit:
I worked a bit around and could catch an instance before crashing.
This is the rs.status() of the crashing member:

instance:OTHER> rs.status()
{
	"state" : 10,
	"stateStr" : "REMOVED",
	"uptime" : 132,
	"optime" : {
		"ts" : Timestamp(1629443538, 3),
		"t" : NumberLong(2)
	},
	"optimeDate" : ISODate("2021-08-20T07:12:18Z"),
	"lastHeartbeatMessage" : "",
	"syncingTo" : "",
	"syncSourceHost" : "",
	"syncSourceId" : -1,
	"infoMessage" : "",
	"ok" : 0,
	"errmsg" : "Our replica set config is invalid or we are not a member of it",
	"code" : 93,
	"codeName" : "InvalidReplicaSetConfig",
	"operationTime" : Timestamp(1629443538, 3),
	"$clusterTime" : {
		"clusterTime" : Timestamp(1629444174, 4),
		"signature" : {
			"hash" : BinData(0,"qDg4XbH1EG68xNyKNof6OKYyQP4="),
			"keyId" : NumberLong("6997429197601767425")
		}
	}
}
1 Like

Hi @Johannes_Petz,

Can you share all of the operator and mongod logs? Also, it may caused by the liveness probe failures, can you also share the output of kubectl describe pod <mongod-pod-name> for each pod in replset?

1 Like

Hi @Ege_Gunes ,
thanks for the fast reply.

I will upload the outputs as attachments. Copy&Paste didn’t work :stuck_out_tongue:

I waited till instance 1 crashed again and took all logs and describes.
The operator log was too big. So I cut some of the duplicate messages in the middle of the file.mongodb-operator.logs.txt (233.0 KB)

mongodb-base-instance-0.describe.txt (12.0 KB)
mongodb-base-instance-0.log.txt (116.8 KB)
mongodb-base-instance-1.describe.txt (12.7 KB)
mongodb-base-instance-1.log.txt (35.7 KB)
mongodb-base-instance-2.describe.txt (12.4 KB)
mongodb-base-instance-2.log.txt (31.7 KB)

Cheers,
Johannes

1 Like

Can you also share the output of the command below (assuming your cluster name is mongodb-base):

kubectl get psmdb mongodb-base -o yaml
1 Like

Yes, of course.
psmdb-get-mongodb-base.txt (8.5 KB)

I had to change the init image to a custom one. It changes the permissions of /data/db.
Description: Percona MongoDB Operator with PersistentVolumeClaim (on CephFS)

1 Like

Hello @Ege_Gunes ,

after the weekend I described the psmdb resource and it stabilized. I don’t know how but here are the status conditions:

Status:
  Conditions:
    ....
    Last Transition Time:  2021-08-20T15:07:30Z
    Message:               instance: ready
    Reason:                RSReady
    Status:                True
    Type:                  ready
    Last Transition Time:  2021-08-21T01:00:37Z
    Status:                True
    Type:                  initializing
    Last Transition Time:  2021-08-21T01:00:55Z
    Status:                True
    Type:                  ready

Do you have any idea, why this happens? I have to make sure this will not happen in production :frowning:
Would not be a good situation to wait for the next day for our customers to work again ^^

1 Like

Hi @Johannes_Petz, I’m sorry I couldn’t investigate this issue further on weekend. From the earlier logs it seems replset is not initialized but operator sees it as initialized.

It’s strange that issue is fixed by itself. Could you share the full operator logs without removing any lines?

1 Like

Don’t be sorry for your weekend :slight_smile: I was off as well.

Here are the logs.
psmdb-opartor.logs.txt.zip.txt (206.1 KB)

I just take a look at the psmdb resource and it is in initializing status again. So it did not recover itself.

1 Like

So I extracted the psmdb component from our helm chart to an own helm chart for reproducing the issue.

After migrating one of the pods to another node (for maintenance, etc.) the pod won’t start again.
Same result as above.

Events:
  Type     Reason                  Age    From                     Message
  ----     ------                  ----   ----                     -------
  Normal   Scheduled               5m32s  default-scheduler        Successfully assigned mongodb-ssl-error/mongodb-base-instance-1 to node-pool0-5
  Warning  FailedMount             3m30s  kubelet                  Unable to attach or mount volumes: unmounted volumes=[mongod-data], unattached volumes=[kube-api-access-8fmf4 mongodb-base-mongodb-keyfile ssl ssl-internal mongodb-mongodb-base-encryption-key users-secret-file mongod-data]: timed out waiting for the condition
  Normal   SuccessfulAttachVolume  3m26s  attachdetach-controller  AttachVolume.Attach succeeded for volume "pvc-122de78d-1ccd-4949-bf3c-b9ff2a3f1b63"
  Normal   Pulled                  3m14s  kubelet                  Successfully pulled image "registry.redicals-ext.de/test/percona-server-mongodb-operator:1.9.2" in 165.668291ms
  Normal   Created                 3m14s  kubelet                  Created container mongo-init
  Normal   Pulling                 3m14s  kubelet                  Pulling image "registry.redicals-ext.de/test/percona-server-mongodb-operator:1.9.2"
  Normal   Started                 3m13s  kubelet                  Started container mongo-init
  Normal   Pulled                  3m10s  kubelet                  Successfully pulled image "percona/percona-server-mongodb:3.6.17" in 1.523458954s
  Warning  Unhealthy               2m2s   kubelet                  Liveness probe failed: {"level":"info","msg":"Running Kubernetes liveness check for mongod","time":"2021-08-23T13:14:09Z"}
{"level":"error","msg":"replSetGetStatus returned error Our replica set config is invalid or we are not a member of it","time":"2021-08-23T13:14:09Z"}
  Warning  Unhealthy  92s  kubelet  Liveness probe failed: {"level":"info","msg":"Running Kubernetes liveness check for mongod","time":"2021-08-23T13:14:39Z"}
{"level":"error","msg":"replSetGetStatus returned error Our replica set config is invalid or we are not a member of it","time":"2021-08-23T13:14:39Z"}
  Warning  Unhealthy  62s  kubelet  Liveness probe failed: {"level":"info","msg":"Running Kubernetes liveness check for mongod","time":"2021-08-23T13:15:09Z"}
{"level":"error","msg":"replSetGetStatus returned error Our replica set config is invalid or we are not a member of it","time":"2021-08-23T13:15:09Z"}
  Warning  Unhealthy  32s  kubelet  Liveness probe failed: {"level":"info","msg":"Running Kubernetes liveness check for mongod","time":"2021-08-23T13:15:39Z"}
{"level":"error","msg":"replSetGetStatus returned error Our replica set config is invalid or we are not a member of it","time":"2021-08-23T13:15:39Z"}
  Normal   Killing    32s                  kubelet  Container mongod failed liveness probe, will be restarted
  Normal   Pulling    31s (x2 over 3m12s)  kubelet  Pulling image "percona/percona-server-mongodb:3.6.17"
  Normal   Created    30s (x2 over 3m10s)  kubelet  Created container mongod
  Normal   Pulled     30s                  kubelet  Successfully pulled image "percona/percona-server-mongodb:3.6.17" in 1.147801233s
  Warning  Unhealthy  30s                  kubelet  Readiness probe failed: dial tcp 10.244.139.162:27017: connect: connection refused
  Normal   Started    29s (x2 over 3m10s)  kubelet  Started container mongod

Do you want that helm chart?

Steps to reproduce:

  1. Start mongodb replicaset cluster
  2. Wait for everything to get ready
  3. cordon and drain one of the nodes a pod is running
  4. watch how the pod tries to recover on another node. It gets ready and will crash after a couple of minutes.
1 Like

I tried some things and removed the liveness probe. I think the ssl errors are not the problem.

I initiated a new cluster migration for every node so the mongodb pods will be migrated to another node. After that I am facing this error:

{"level":"error","ts":1629733831.9139616,"logger":"controller_psmdb","msg":"failed to reconcile cluster","Request.Namespace":"my-namespace-beta","Request.Name":"mongodb-base","replset":"instance","error":"dial:: failed to ping mongo: context deadline exceeded","errorVerbose":"failed to ping mongo: context deadline exceeded\ngithub.com/percona/percona-server-mongodb-operator/pkg/psmdb/mongo.Dial\n\t/go/src/github.com/percona/percona-server-mongodb-operator/pkg/psmdb/mongo/mongo.go:61\ngithub.com/percona/percona-server-mongodb-operator/pkg/controller/perconaservermongodb.(*ReconcilePerconaServerMongoDB).mongoClient\n\t/go/src/github.com/percona/percona-server-mongodb-operator/pkg/controller/perconaservermongodb/connections.go:59\ngithub.com/percona/percona-server-mongodb-operator/pkg/controller/perconaservermongodb.(*ReconcilePerconaServerMongoDB).mongoClientWithRole\n\t/go/src/github.com/percona/percona-server-mongodb-operator/pkg/controller/perconaservermongodb/connections.go:27\ngithub.com/percona/percona-server-mongodb-operator/pkg/controller/perconaservermongodb.(*ReconcilePerconaServerMongoDB).reconcileCluster\n\t/go/src/github.com/percona/percona-server-mongodb-operator/pkg/controller/perconaservermongodb/mgo.go:27\ngithub.com/percona/percona-server-mongodb-operator/pkg/controller/perconaservermongodb.(*ReconcilePerconaServerMongoDB).Reconcile\n\t/go/src/github.com/percona/percona-server-mongodb-operator/pkg/controller/perconaservermongodb/psmdb_controller.go:428\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/src/github.com/percona/percona-server-mongodb-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:256\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/src/github.com/percona/percona-server-mongodb-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:232\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker\n\t/go/src/github.com/percona/percona-server-mongodb-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:211\nk8s.io/apimachinery/pkg/util/wait.JitterUntil.func1\n\t/go/src/github.com/percona/percona-server-mongodb-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:152\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/src/github.com/percona/percona-server-mongodb-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:153\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/go/src/github.com/percona/percona-server-mongodb-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1371\ndial:\ngithub.com/percona/percona-server-mongodb-operator/pkg/controller/perconaservermongodb.(*ReconcilePerconaServerMongoDB).reconcileCluster\n\t/go/src/github.com/percona/percona-server-mongodb-operator/pkg/controller/perconaservermongodb/mgo.go:31\ngithub.com/percona/percona-server-mongodb-operator/pkg/controller/perconaservermongodb.(*ReconcilePerconaServerMongoDB).Reconcile\n\t/go/src/github.com/percona/percona-server-mongodb-operator/pkg/controller/perconaservermongodb/psmdb_controller.go:428\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/src/github.com/percona/percona-server-mongodb-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:256\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/src/github.com/percona/percona-server-mongodb-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:232\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker\n\t/go/src/github.com/percona/percona-server-mongodb-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:211\nk8s.io/apimachinery/pkg/util/wait.JitterUntil.func1\n\t/go/src/github.com/percona/percona-server-mongodb-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:152\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/src/github.com/percona/percona-server-mongodb-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:153\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/go/src/github.com/percona/percona-server-mongodb-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1371","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/src/github.com/percona/percona-server-mongodb-operator/vendor/github.com/go-logr/zapr/zapr.go:128\ngithub.com/percona/percona-server-mongodb-operator/pkg/controller/perconaservermongodb.(*ReconcilePerconaServerMongoDB).Reconcile\n\t/go/src/github.com/percona/percona-server-mongodb-operator/pkg/controller/perconaservermongodb/psmdb_controller.go:430\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/src/github.com/percona/percona-server-mongodb-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:256\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/src/github.com/percona/percona-server-mongodb-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:232\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker\n\t/go/src/github.com/percona/percona-server-mongodb-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:211\nk8s.io/apimachinery/pkg/util/wait.JitterUntil.func1\n\t/go/src/github.com/percona/percona-server-mongodb-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:152\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/src/github.com/percona/percona-server-mongodb-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:153\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/go/src/github.com/percona/percona-server-mongodb-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88"}

Extracted error message:

"msg":"failed to reconcile cluster",
"Request.Namespace":"my-namespace-beta",
"Request.Name":"mongodb-base",
"replset":"instance",
"error":"dial:: failed to ping mongo: context deadline exceeded",
"errorVerbose":"failed to ping mongo: context deadline exceeded

But I can ping the mongod process. I can get rs.status() as well. And here every node tells me it is REMOVED.

Percona Server for MongoDB shell version v3.6.17-4.0
connecting to: mongodb://127.0.0.1:27017/?gssapiServiceName=mongodb
Implicit session: session { "id" : UUID("d1c3d961-60e2-4fa9-ae15-f853fc350754") }
Percona Server for MongoDB server version: v3.6.17-4.0
{
	"state" : 10,
	"stateStr" : "REMOVED",
	"uptime" : 92,
	"optime" : {
		"ts" : Timestamp(1629729028, 6),
		"t" : NumberLong(1)
	},
	"optimeDate" : ISODate("2021-08-23T14:30:28Z"),
	"lastHeartbeatMessage" : "",
	"syncingTo" : "",
	"syncSourceHost" : "",
	"syncSourceId" : -1,
	"infoMessage" : "",
	"ok" : 0,
	"errmsg" : "Our replica set config is invalid or we are not a member of it",
	"code" : 93,
	"codeName" : "InvalidReplicaSetConfig",
	"operationTime" : Timestamp(1629729028, 6),
	"$clusterTime" : {
		"clusterTime" : Timestamp(1629731984, 2),
		"signature" : {
			"hash" : BinData(0,"HOvu4Jz8WwPp7VldCyi2yFat6wg="),
			"keyId" : NumberLong("6999615580768567297")
		}
	}
}

I really don’t get it. Why can’t the operator add the pods to the cluster after migration?
Any ideas?

Edit: I found the default liveness probe and it is:

    livenessProbe:
      exec:
        command:
        - /data/db/mongodb-healthcheck
        - k8s
        - liveness
        - --ssl
        - --sslInsecure
        - --sslCAFile
        - /etc/mongodb-ssl/ca.crt
        - --sslPEMKeyFile
        - /tmp/tls.pem
        - --startupDelaySeconds
        - "7200"
      failureThreshold: 4
      initialDelaySeconds: 120
      periodSeconds: 30
      successThreshold: 1
      timeoutSeconds: 5

A describe tells me it gets killed by the liveness probe…

  Warning  Unhealthy               43m                kubelet                  Liveness probe failed: {"level":"info","msg":"Running Kubernetes liveness check for mongod","time":"2021-08-23T15:17:35Z"}
{"level":"error","msg":"replSetGetStatus returned error Our replica set config is invalid or we are not a member of it","time":"2021-08-23T15:17:35Z"}
  Warning  Unhealthy  43m  kubelet  Liveness probe failed: {"level":"info","msg":"Running Kubernetes liveness check for mongod","time":"2021-08-23T15:18:05Z"}
{"level":"error","msg":"replSetGetStatus returned error Our replica set config is invalid or we are not a member of it","time":"2021-08-23T15:18:05Z"}
  Warning  Unhealthy  42m  kubelet  Liveness probe failed: {"level":"info","msg":"Running Kubernetes liveness check for mongod","time":"2021-08-23T15:18:35Z"}
{"level":"error","msg":"replSetGetStatus returned error Our replica set config is invalid or we are not a member of it","time":"2021-08-23T15:18:35Z"}
  Warning  Unhealthy  42m  kubelet  Liveness probe failed: {"level":"info","msg":"Running Kubernetes liveness check for mongod","time":"2021-08-23T15:19:05Z"}
{"level":"error","msg":"replSetGetStatus returned error Our replica set config is invalid or we are not a member of it","time":"2021-08-23T15:19:05Z"}

I expaned the failureThreshold to 10. Let’s see if this helps.

1 Like

After a lot of research and the help of Percona Support, I could solve the problem.

    replsets:
    - name: replset-name
      ...
      expose:
        enabled: false
        exposeType: ClusterIP

Was one of the changes. After this I had to delete the PSMDB with the persistent volumes(!) and recreate the resources.
I think otherwise anything went wrong and the operator was not able to reconnect to the replset nodes and join them to the cluster.

@Ege_Gunes thank you for your help!

Cheers Johannes

2 Likes

Thanks for rising the issue.

1 Like