Mongo replset fails to restart if backup is switch on/off

Hi, I have recently encountered this issue:

  • Create Mongo replset, backup disable → Works fine.
  • Modify the CR to enable backup → PSMDB stucks in initializing state.

Error logs from the operator after modifying the CR:

{
   "level":"error",
   "ts":1630657894.349765,
   "logger":"controller-runtime.controller",
   "msg":"Reconciler error",
   "controller":"psmdb-controller",
   "request":"percona-mongodb/loci-dev",
   "error":"reconcile StatefulSet for rs0: failed to run smartUpdate: failed to check active jobs: getting pbm object: create PBM connection to 10.171.130.44:30969,10.171.130.45:32443,10.171.130.17:30392: create mongo connection: mongo ping: server selection error: server selection timeout, current topology: { Type: ReplicaSetNoPrimary, Servers: [{ Addr: 10.171.130.44:30969, Type: RSGhost, Average RTT: 677173 }, { Addr: 10.171.130.45:32443, Type: RSSecondary, Average RTT: 569239 }, { Addr: 10.171.130.17:30392, Type: RSGhost, Average RTT: 723416 }, ] }",
   "errorVerbose":"reconcile StatefulSet for rs0: failed to run smartUpdate: failed to check active jobs: getting pbm object: create PBM connection to 10.171.130.44:30969,10.171.130.45:32443,10.171.130.17:30392: create mongo connection: mongo ping: server selection error: server selection timeout, current topology: { Type: ReplicaSetNoPrimary, Servers: [{ Addr: 10.171.130.44:30969, Type: RSGhost, Average RTT: 677173 }, { Addr: 10.171.130.45:32443, Type: RSSecondary, Average RTT: 569239 }, { Addr: 10.171.130.17:30392, Type: RSGhost, Average RTT: 723416 }, ] }\ngithub.com/percona/percona-server-mongodb-operator/pkg/controller/perconaservermongodb.(*ReconcilePerconaServerMongoDB).Reconcile\n\t/go/src/github.com/percona/percona-server-mongodb-operator/pkg/controller/perconaservermongodb/psmdb_controller.go:365\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/src/github.com/percona/percona-server-mongodb-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:256\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/src/github.com/percona/percona-server-mongodb-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:232\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker\n\t/go/src/github.com/percona/percona-server-mongodb-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:211\nk8s.io/apimachinery/pkg/util/wait.JitterUntil.func1\n\t/go/src/github.com/percona/percona-server-mongodb-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:152\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/src/github.com/percona/percona-server-mongodb-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:153\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/go/src/github.com/percona/percona-server-mongodb-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1371",
   "stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/src/github.com/percona/percona-server-mongodb-operator/vendor/github.com/go-logr/zapr/zapr.go:128\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/src/github.com/percona/percona-server-mongodb-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:258\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/src/github.com/percona/percona-server-mongodb-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:232\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker\n\t/go/src/github.com/percona/percona-server-mongodb-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:211\nk8s.io/apimachinery/pkg/util/wait.JitterUntil.func1\n\t/go/src/github.com/percona/percona-server-mongodb-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:152\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/src/github.com/percona/percona-server-mongodb-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:153\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/go/src/github.com/percona/percona-server-mongodb-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88"
}

Error logs from the backup-agent container:

2021-09-03T08:40:11.612+0000 W  NETWORK  [ReplicaSetMonitor-TaskExecutor] Unable to reach primary for set rs0
2021-09-03T08:40:11.612+0000 I  NETWORK  [ReplicaSetMonitor-TaskExecutor] Cannot reach any nodes for set rs0. Please check network connectivity and the status of the set. This has happened for 2 checks in a row.
1 Like

Hello @vhphan ,

I cannot reproduce it.
I have taken 1.9.0 Operator, deployed default cr.yaml with disabled backups.
Enabled backups by changing spec.backup.enabled to true.
PSMDB is ready and no errors.

Is there anything else about your deployment?

1 Like