Cluster-wide PSMDB Operator 1.14.0 on OpenShift: Watched namespace pods fail to transition into Ready state due to 'Could not find address' error

Dear Percona Team,

I am facing an issue while using the Percona MongoDB Operator 1.14.0. I have installed the operator in cluster-wide mode in OpenShift according to cluster-wide and openshift documentation. Installed by updating namespace, WATCH_NAMESPACE values and applying oc create -f cw-bundle.yaml.

            - name: WATCH_NAMESPACE
              value: 'test-psmdb-ns1,test-psmdb-ns2'

However, the pods that the operator is bringing up in the watched namespace are unable to transition into the ready state:

NAME                       READY   STATUS             RESTARTS          AGE
minimal-cluster-cfg-0      1/1     Running            122 (7m21s ago)   10h
minimal-cluster-mongos-0   0/1     Running            2 (10h ago)       10h
minimal-cluster-rs0-0      1/1     Running            122 (7m21s ago)   10h
mongo-dev-cfg-0            0/1     CrashLoopBackOff   111 (108s ago)    9h
mongo-dev-mongos-0         0/1     Running            0                 9h
mongo-dev-rs0-0            0/1     CrashLoopBackOff   111 (108s ago)    9h

and within the following error:

{"t":{"$date":"2023-04-06T23:25:37.100+00:00"},"s":"I",  "c":"-",        "id":4333222, "ctx":"ReplicaSetMonitor-TaskExecutor","msg":"RSM received error response","attr":{"host":"minimal-cluster-cfg-0.minimal-cluster-cfg.test-psmdb-ns1.svc.cluster.local:27017","error":"HostUnreachable: Error connecting to minimal-cluster-cfg-0.minimal-cluster-cfg.test-psmdb-ns1.svc.cluster.local:27017 :: caused by :: Could not find address for minimal-cluster-cfg-0.minimal-cluster-cfg.test-psmdb-ns1.svc.cluster.local:27017: SocketException: Host not found (authoritative)","replicaSet":"cfg","response":{}}}
{"t":{"$date":"2023-04-06T23:25:37.100+00:00"},"s":"I",  "c":"NETWORK",  "id":4712102, "ctx":"ReplicaSetMonitor-TaskExecutor","msg":"Host failed in replica set","attr":{"replicaSet":"cfg","host":"minimal-cluster-cfg-0.minimal-cluster-cfg.test-psmdb-ns1.svc.cluster.local:27017","error":{"code":6,"codeName":"HostUnreachable","errmsg":"Error connecting to minimal-cluster-cfg-0.minimal-cluster-cfg.test-psmdb-ns1.svc.cluster.local:27017 :: caused by :: Could not find address for minimal-cluster-cfg-0.minimal-cluster-cfg.test-psmdb-ns1.svc.cluster.local:27017: SocketException: Host not found (authoritative)"},"action":{"dropConnections":true,"requestImmediateCheck":false,"outcome":{"host":"minimal-cluster-cfg-0.minimal-cluster-cfg.test-psmdb-ns1.svc.cluster.local:27017","success":false,"errorMessage":"HostUnreachable: Error connecting to minimal-cluster-cfg-0.minimal-cluster-cfg.test-psmdb-ns1.svc.cluster.local:27017 :: caused by :: Could not find address for minimal-cluster-cfg-0.minimal-cluster-cfg.test-psmdb-ns1.svc.cluster.local:27017: SocketException: Host not found (authoritative)"}}}}

In the operator’s namespace, MongoDB is launching successfully and transitioning into the ready state.

NAME                                               READY   STATUS    RESTARTS   AGE
mongo-dev-cfg-0                                    1/1     Running   0          10h
mongo-dev-mongos-0                                 1/1     Running   0          10h
mongo-dev-rs0-0                                    1/1     Running   0          10h
percona-server-mongodb-operator-5445fd995f-5ldc8   1/1     Running   0          11h

Additional log files --tail=20
operator-ns.txt (27.0 KB)
user-ns.txt (15.3 KB)

Could you please help me to investigate this?

Thanks in advance!

Hello @aporrinali ,

couple of questions:

  1. which openshift version is it? (if I want to reproduce it)
  2. They never get to ready state or fail after some time?

Hello @Sergey_Pronin,

OpenShift is 4.12.10, and pods are never reach ready state, constantly restarting.

@Ivan_Pylypenko have you seen anything like this before?

Hi guys

Nope, never seen it before. @aporrinali could you please share your CR configuration. If there any security sensitive info, please omit it.

Hi @Ivan_Pylypenko,
Sorry for so late response…

Please see, but it looks the same…
‘txt’ = ‘yaml’ files
cw-bundle.txt (829.6 KB)
percona-server-mongodb-operator-5445fd995f-qwccb.log (683.0 KB)

minimal-cluster.txt (3.8 KB)

minimal-cluster-cfg-0.txt (9.2 KB)
minimal-cluster-cfg-0.log (77.8 KB)

minimal-cluster-mongos-0.txt (9.1 KB)
minimal-cluster-mongos-0.log (2.5 MB)

minimal-cluster-rs0-0.txt (9.2 KB)
minimal-cluster-rs0-0.log (36.0 KB)

Small update.
Tried to run without sharding.
Operator log:

2023-05-16T22:29:56.459Z	ERROR	failed to reconcile cluster	{"controller": "psmdb-controller", "object": {"name":"mongo-minimal","namespace":"test-psmdb-1"}, "namespace": "test-psmdb-1", "name": "mongo-minimal", "reconcileID": "ce40ab38-9237-46bd-b5c6-bf47214d0098", "replset": "rs0", "error": "handleReplsetInit: exec add admin user: command terminated with exit code 1 / Warning: Could not access file: EACCES: permission denied, mkdir '/.mongodb'\nCurrent Mongosh Log ID:\t646403e4d9945bcf6c2227b8\nConnecting to:\t\tmongodb://127.0.0.1:27017/?directConnection=true&serverSelectionTimeoutMS=2000&appName=mongosh+1.6.2\nUsing MongoDB:\t\t6.0.4-3\nUsing Mongosh:\t\t1.6.2\n\nFor mongosh info see: https://docs.mongodb.com/mongodb-shell/\n\n\nTo help improve our products, anonymous usage data is collected and sent to MongoDB periodically (https://www.mongodb.com/legal/privacy-policy).\nYou can opt-out by running the disableTelemetry() command.\n\n\nError: Could not open history file.\nREPL session history will not be persisted.\n\u001b[1G\u001b[0J \u001b[1G / MongoServerError: command createUser requires authentication\n", "errorVerbose": "exec add admin user: command terminated with exit code 1 / Warning: Could not access file: EACCES: permission denied, mkdir '/.mongodb'\nCurrent Mongosh Log ID:\t646403e4d9945bcf6c2227b8\nConnecting to:\t\tmongodb://127.0.0.1:27017/?directConnection=true&serverSelectionTimeoutMS=2000&appName=mongosh+1.6.2\nUsing MongoDB:\t\t6.0.4-3\nUsing Mongosh:\t\t1.6.2\n\nFor mongosh info see: https://docs.mongodb.com/mongodb-shell/\n\n\nTo help improve our products, anonymous usage data is collected and sent to MongoDB periodically (https://www.mongodb.com/legal/privacy-policy).\nYou can opt-out by running the disableTelemetry() command.\n\n\nError: Could not open history file.\nREPL session history will not be persisted.\n\u001b[1G\u001b[0J \u001b[1G / MongoServerError: command createUser requires authentication\n\nhandleReplsetInit\ngithub.com/percona/percona-server-mongodb-operator/pkg/controller/perconaservermongodb.(*ReconcilePerconaServerMongoDB).reconcileCluster\n\t/go/src/github.com/percona/percona-server-mongodb-operator/pkg/controller/perconaservermongodb/mgo.go:99\ngithub.com/percona/percona-server-mongodb-operator/pkg/controller/perconaservermongodb.(*ReconcilePerconaServerMongoDB).Reconcile\n\t/go/src/github.com/percona/percona-server-mongodb-operator/pkg/controller/perconaservermongodb/psmdb_controller.go:487\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.4/pkg/internal/controller/controller.go:122\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.4/pkg/internal/controller/controller.go:323\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.4/pkg/internal/controller/controller.go:274\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.4/pkg/internal/controller/controller.go:235\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1594"}
{"t":{"$date":"2023-05-16T22:29:18.096+00:00"},"s":"I",  "c":"ACCESS",   "id":20249,   "ctx":"conn117","msg":"Authentication failed","attr":{"mechanism":"SCRAM-SHA-256","speculative":true,"principalName":"clusterMonitor","authenticationDatabase":"admin","remote":"172.20.13.112:48034","extraInfo":{},"error":"UserNotFound: Could not find user \"clusterMonitor\" for db \"admin\""}}

{"t":{"$date":"2023-05-16T22:29:18.097+00:00"},"s":"I",  "c":"ACCESS",   "id":20249,   "ctx":"conn117","msg":"Authentication failed","attr":{"mechanism":"SCRAM-SHA-1","speculative":false,"principalName":"clusterMonitor","authenticationDatabase":"admin","remote":"172.20.13.112:48034","extraInfo":{},"error":"UserNotFound: Could not find user \"clusterMonitor\" for db \"admin\""}}

ns-o-operator.log (104.1 KB)
mongo-minimal-ns-1.log (150.6 KB)

Hello everyone,
I guess issue may be closed, the reason was found. And mostly because of our OpenShift setup…

From the beginning.
PSMDB operator monitors the namespace it resides in for the MongoDB. For testing purposes, a MongoDB cluster was launched in that namespace, and it started successfully and was accessible.

The problem arose when attempting to start MongoDB in a separate namespaces. The operator couldn’t complete the initialization of MongoDB in those namespaces. The MongoDB cluster kept restarting consistently. Here is a snippet from the operator’s log, which essentially shows the only error that could be worked with:

"error": "handleReplsetInit: exec add admin user: command terminated with exit code 1 / Warning: Could not access file: EACCES: permission denied, mkdir '/.mongodb'\nCurrent Mongosh Log ID:\t646403e4d9945bcf6c2227b8\nConnecting to:\t\tmongodb://127.0.0.1:27017/?directConnection=true&serverSelectionTimeoutMS=2000&appName=mongosh+1.6.2

The whole mystery was in the NetworkPolicy. I was creating projects either through the UI or with the oc new-project <name> command. And within it, some additional netpol were created.

❯ oc get netpol
NAME                                POD-SELECTOR   AGE
allow-from-openshift-ingress        <none>         54m
allow-from-openshift-monitoring     <none>         54m
allow-from-openshift-web-terminal   <none>         54m
allow-same-namespace                <none>         54m

❯ oc get netpol allow-from-openshift-ingress -o yaml

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-from-openshift-ingress
  namespace: test-psmdb-ns1
spec:
  ingress:
 - from:
   - namespaceSelector:
       matchLabels:
         network.openshift.io/policy-group: ingress
  podSelector: {}
  policyTypes:
 - Ingress
status: {}

The netpol/allow-from-openshift-ingress indicates that a specific label is required on the namespace to allow ingress, in this case, from the operator’s namespace.
Adding the label network.openshift.io/policy-group: ingress to the operator’s namespace resolved the issue, and the MongoDB cluster started successfully in the adjacent namespace.

Deleting netpol was not an option, because we do need it in our OpenShift setup.

Also was a bit helpful to know the difference between oc new-project <name> and oc create namespace <name>

In addition,
@Sergey_Pronin, are there any plans to implement cluster-wide mode setup from OpenShift OperatorHub?

@aporrinali the thing is how operator hub is structured in the backend. It is either we create a separate product under operatorhub and maintain both or don’t do it at all. For now we have it in our backlog, but a bit hesitant to put more resources into it.

Is there any specific reason why you need OperatorHub? Is it a strong requirement for you?

Hello,

It’s a very good question and I agree that there are many pros and cons here…
I would say, because of another layer of verification and certification of Operators available through the Certified channel of OperatorHub.

Perhaps… OperatorHub is not such a good option… too many cons…

Anyway, thanks for your answer.