Mongo replset fails to restart if backup is switch on/off

Hi, I have recently encountered this issue:

  • Create Mongo replset, backup disable → Works fine.
  • Modify the CR to enable backup → PSMDB stucks in initializing state.

Error logs from the operator after modifying the CR:

{
   "level":"error",
   "ts":1630657894.349765,
   "logger":"controller-runtime.controller",
   "msg":"Reconciler error",
   "controller":"psmdb-controller",
   "request":"percona-mongodb/loci-dev",
   "error":"reconcile StatefulSet for rs0: failed to run smartUpdate: failed to check active jobs: getting pbm object: create PBM connection to 10.171.130.44:30969,10.171.130.45:32443,10.171.130.17:30392: create mongo connection: mongo ping: server selection error: server selection timeout, current topology: { Type: ReplicaSetNoPrimary, Servers: [{ Addr: 10.171.130.44:30969, Type: RSGhost, Average RTT: 677173 }, { Addr: 10.171.130.45:32443, Type: RSSecondary, Average RTT: 569239 }, { Addr: 10.171.130.17:30392, Type: RSGhost, Average RTT: 723416 }, ] }",
   "errorVerbose":"reconcile StatefulSet for rs0: failed to run smartUpdate: failed to check active jobs: getting pbm object: create PBM connection to 10.171.130.44:30969,10.171.130.45:32443,10.171.130.17:30392: create mongo connection: mongo ping: server selection error: server selection timeout, current topology: { Type: ReplicaSetNoPrimary, Servers: [{ Addr: 10.171.130.44:30969, Type: RSGhost, Average RTT: 677173 }, { Addr: 10.171.130.45:32443, Type: RSSecondary, Average RTT: 569239 }, { Addr: 10.171.130.17:30392, Type: RSGhost, Average RTT: 723416 }, ] }\ngithub.com/percona/percona-server-mongodb-operator/pkg/controller/perconaservermongodb.(*ReconcilePerconaServerMongoDB).Reconcile\n\t/go/src/github.com/percona/percona-server-mongodb-operator/pkg/controller/perconaservermongodb/psmdb_controller.go:365\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/src/github.com/percona/percona-server-mongodb-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:256\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/src/github.com/percona/percona-server-mongodb-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:232\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker\n\t/go/src/github.com/percona/percona-server-mongodb-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:211\nk8s.io/apimachinery/pkg/util/wait.JitterUntil.func1\n\t/go/src/github.com/percona/percona-server-mongodb-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:152\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/src/github.com/percona/percona-server-mongodb-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:153\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/go/src/github.com/percona/percona-server-mongodb-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1371",
   "stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/src/github.com/percona/percona-server-mongodb-operator/vendor/github.com/go-logr/zapr/zapr.go:128\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/src/github.com/percona/percona-server-mongodb-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:258\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/src/github.com/percona/percona-server-mongodb-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:232\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker\n\t/go/src/github.com/percona/percona-server-mongodb-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:211\nk8s.io/apimachinery/pkg/util/wait.JitterUntil.func1\n\t/go/src/github.com/percona/percona-server-mongodb-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:152\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/src/github.com/percona/percona-server-mongodb-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:153\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/go/src/github.com/percona/percona-server-mongodb-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88"
}

Error logs from the backup-agent container:

2021-09-03T08:40:11.612+0000 W  NETWORK  [ReplicaSetMonitor-TaskExecutor] Unable to reach primary for set rs0
2021-09-03T08:40:11.612+0000 I  NETWORK  [ReplicaSetMonitor-TaskExecutor] Cannot reach any nodes for set rs0. Please check network connectivity and the status of the set. This has happened for 2 checks in a row.
1 Like

Hello @vhphan ,

I cannot reproduce it.
I have taken 1.9.0 Operator, deployed default cr.yaml with disabled backups.
Enabled backups by changing spec.backup.enabled to true.
PSMDB is ready and no errors.

Is there anything else about your deployment?

1 Like

No… It’s all default. Idk if it’s again a specific problem with my k8s provider. I will update if there is anything else I can find.

1 Like

Having same issue.
Operator 1.13
Mongodb 4.4
@vhphan have found the issue?

1 Like

Hi @reab !
For me it seems to work, but I didn’t use the ssl certificates from the “deploy” directory. I have just left the operator to generate the certificates (just didn’t apply the example ones from the directory), because the ssl certificates which are placed in deploy directory have some hardcoded altNames for namespace “psmdb” and it seems they won’t work properly in other namespaces.
Could you please re-check and followup with info if that works for you or not?

1 Like

Hi @Tomislav_Plavcic
the operator generated the certificates as I am using helm to enable backup.

The issue appeared after upgrading to Operator 1.13, I have the cluster running with allowUnsafeConfigurations: false and when I changed it to allowUnsafeConfigurations: true The issue disappeared. It seems it’s related to the change made to fix this [K8SPSMDB-515] Allow setting requireTLS mode for MongoDB through the Operator - Percona JIRA
See here

My question now, is how to have allowUnsafeConfigurations: false without getting the SSL error

For your reference the error is:
check for concurrent jobs: getting pbm object: create PBM connection to Cluster-Y create mongo connection: mongo ping: server selection error: server selection timeout, current topology: { Type: ReplicaSetNoPrimary, Servers: [{ Addr: IP:27017, Type: Unknown, Last error: connection() error occured during connection handshake: x509: cannot validate certificate for IP because it doesn't contain any IP SANs }, { Addr: IP:27017, Type: Unknown, Last error: connection() error occured during connection handshake: x509: cannot validate certificate for IP because it doesn't contain any IP SANs }, { Addr: IP:27017, Type: Unknown, Last error: connection() error occured during connection handshake: x509: cannot validate certificate for IP because it doesn't contain any IP SANs }, { Addr: IP:27017, Type: Unknown, Last error: connection() error occured during connection handshake: x509: cannot validate certificate for IP because it doesn't contain any IP SANs }, { Addr: IP:27017, Type: Unknown, Last error: connection() error occured during connection handshake: x509: cannot validate certificate for IP because it doesn't contain any IP SANs }, ] }

1 Like

Hi @reab !
How are you running with operator 1.13.0 when the (official) helm chart is still not released?
I have tried to run 1.13.0 (not with helm) with allowUnsafeConfigurations: false and I don’t see your error:

NAME                                               READY   STATUS    RESTARTS   AGE
my-cluster-name-rs0-0                              2/2     Running   0          6m7s
percona-server-mongodb-operator-6677c8cbf7-ckjcg   1/1     Running   0          6m44s

Also mind you that we have some similar issue reported related to migration from/to safe configuration: [K8SPSMDB-780] Failed to downscale/upscale cluster to unsafe configuration - Percona JIRA
Maybe it’s somehow related.

1 Like

I am also running the 1.13 operator and have run into this. I can stand up a small cluster, I have backups disabled from the onset, (In my case I’m testing out the some other features so I dont need backups enabled) I make a change, such as to the mongod configuration, and the operator fails with the same error. It is trying to check pbm but a) I dont have backups enabled and b) it is doing so by IP, which is not going to be part of the certificate(The cert is generated by the operator, I am not providing one)

2022-12-09T17:03:04.908Z	INFO	controller_psmdb	StatefulSet is changed, starting smart update	{"name": "main-psmdb-db-cfg"}
2022-12-09T17:03:34.928Z	ERROR	controller.psmdb-controller	Reconciler error	{"name": "main-psmdb-db", "namespace": "ns-team-ads-test", "error": "reconcile StatefulSet for cfg: failed to run smartUpdate: failed to check active jobs: getting pbm object: create PBM connection to main-psmdb-db-rs0-0.main-psmdb-db-rs0.ns-team-ads-test.svc.cluster.local:27017,main-psmdb-db-rs0-2.main-psmdb-db-rs0.ns-team-ads-test.svc.cluster.local:27017,main-psmdb-db-rs0-1.main-psmdb-db-rs0.ns-team-ads-test.svc.cluster.local:27017: create mongo connection: mongo ping: server selection error: server selection timeout, current topology: { Type: ReplicaSetNoPrimary, Servers: [{ Addr: 192.168.87.140:27017, Type: Unknown, Last error: connection() error occured during connection handshake: x509: cannot validate certificate for 192.168.87.140 because it doesn't contain any IP SANs }, { Addr: 192.168.228.98:27017, Type: Unknown, Last error: connection() error occured during connection handshake: x509: cannot validate certificate for 192.168.228.98 because it doesn't contain any IP SANs }, { Addr: 192.168.204.96:27017, Type: Unknown, Last error: connection() error occured during connection handshake: x509: cannot validate certificate for 192.168.204.96 because it doesn't contain any IP SANs }, ] }", "errorVerbose": "reconcile StatefulSet for cfg: failed to run smartUpdate: failed to check active jobs: getting pbm object: create PBM connection to main-psmdb-db-rs0-0.main-psmdb-db-rs0.ns-team-ads-test.svc.cluster.local:27017,main-psmdb-db-rs0-2.main-psmdb-db-rs0.ns-team-ads-test.svc.cluster.local:27017,main-psmdb-db-rs0-1.main-psmdb-db-rs0.ns-team-ads-test.svc.cluster.local:27017: create mongo connection: mongo ping: server selection error: server selection timeout, current topology: { Type: ReplicaSetNoPrimary, Servers: [{ Addr: 192.168.87.140:27017, Type: Unknown, Last error: connection() error occured during connection handshake: x509: cannot validate certificate for 192.168.87.140 because it doesn't contain any IP SANs }, { Addr: 192.168.228.98:27017, Type: Unknown, Last error: connection() error occured during connection handshake: x509: cannot validate certificate for 192.168.228.98 because it doesn't contain any IP SANs }, { Addr: 192.168.204.96:27017, Type: Unknown, Last error: connection() error occured during connection handshake: x509: cannot validate certificate for 192.168.204.96 because it doesn't contain any IP SANs }, ] }\ngithub.com/percona/percona-server-mongodb-operator/pkg/controller/perconaservermongodb.(*ReconcilePerconaServerMongoDB).Reconcile\n\t/go/src/github.com/percona/percona-server-mongodb-operator/pkg/controller/perconaservermongodb/psmdb_controller.go:415\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/src/github.com/percona/percona-server-mongodb-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:114\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/src/github.com/percona/percona-server-mongodb-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:311\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/src/github.com/percona/percona-server-mongodb-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/src/github.com/percona/percona-server-mongodb-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:227\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1571"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
/go/src/github.com/percona/percona-server-mongodb-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
/go/src/github.com/percona/percona-server-mongodb-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:227
1 Like