Cluster-wide PSMDB Operator 1.14.0 on OpenShift: Watched namespace pods fail to transition into Ready state due to 'Could not find address' error

aporrinali · April 6, 2023, 11:57pm

Dear Percona Team,

I am facing an issue while using the Percona MongoDB Operator 1.14.0. I have installed the operator in cluster-wide mode in OpenShift according to cluster-wide and openshift documentation. Installed by updating namespace, WATCH_NAMESPACE values and applying oc create -f cw-bundle.yaml.

            - name: WATCH_NAMESPACE
              value: 'test-psmdb-ns1,test-psmdb-ns2'

However, the pods that the operator is bringing up in the watched namespace are unable to transition into the ready state:

NAME                       READY   STATUS             RESTARTS          AGE
minimal-cluster-cfg-0      1/1     Running            122 (7m21s ago)   10h
minimal-cluster-mongos-0   0/1     Running            2 (10h ago)       10h
minimal-cluster-rs0-0      1/1     Running            122 (7m21s ago)   10h
mongo-dev-cfg-0            0/1     CrashLoopBackOff   111 (108s ago)    9h
mongo-dev-mongos-0         0/1     Running            0                 9h
mongo-dev-rs0-0            0/1     CrashLoopBackOff   111 (108s ago)    9h

and within the following error:

{"t":{"$date":"2023-04-06T23:25:37.100+00:00"},"s":"I",  "c":"-",        "id":4333222, "ctx":"ReplicaSetMonitor-TaskExecutor","msg":"RSM received error response","attr":{"host":"minimal-cluster-cfg-0.minimal-cluster-cfg.test-psmdb-ns1.svc.cluster.local:27017","error":"HostUnreachable: Error connecting to minimal-cluster-cfg-0.minimal-cluster-cfg.test-psmdb-ns1.svc.cluster.local:27017 :: caused by :: Could not find address for minimal-cluster-cfg-0.minimal-cluster-cfg.test-psmdb-ns1.svc.cluster.local:27017: SocketException: Host not found (authoritative)","replicaSet":"cfg","response":{}}}
{"t":{"$date":"2023-04-06T23:25:37.100+00:00"},"s":"I",  "c":"NETWORK",  "id":4712102, "ctx":"ReplicaSetMonitor-TaskExecutor","msg":"Host failed in replica set","attr":{"replicaSet":"cfg","host":"minimal-cluster-cfg-0.minimal-cluster-cfg.test-psmdb-ns1.svc.cluster.local:27017","error":{"code":6,"codeName":"HostUnreachable","errmsg":"Error connecting to minimal-cluster-cfg-0.minimal-cluster-cfg.test-psmdb-ns1.svc.cluster.local:27017 :: caused by :: Could not find address for minimal-cluster-cfg-0.minimal-cluster-cfg.test-psmdb-ns1.svc.cluster.local:27017: SocketException: Host not found (authoritative)"},"action":{"dropConnections":true,"requestImmediateCheck":false,"outcome":{"host":"minimal-cluster-cfg-0.minimal-cluster-cfg.test-psmdb-ns1.svc.cluster.local:27017","success":false,"errorMessage":"HostUnreachable: Error connecting to minimal-cluster-cfg-0.minimal-cluster-cfg.test-psmdb-ns1.svc.cluster.local:27017 :: caused by :: Could not find address for minimal-cluster-cfg-0.minimal-cluster-cfg.test-psmdb-ns1.svc.cluster.local:27017: SocketException: Host not found (authoritative)"}}}}

In the operator’s namespace, MongoDB is launching successfully and transitioning into the ready state.

NAME                                               READY   STATUS    RESTARTS   AGE
mongo-dev-cfg-0                                    1/1     Running   0          10h
mongo-dev-mongos-0                                 1/1     Running   0          10h
mongo-dev-rs0-0                                    1/1     Running   0          10h
percona-server-mongodb-operator-5445fd995f-5ldc8   1/1     Running   0          11h

Additional log files --tail=20
operator-ns.txt (27.0 KB)
user-ns.txt (15.3 KB)

Could you please help me to investigate this?

Thanks in advance!

Sergey_Pronin · April 12, 2023, 3:55pm

Hello @aporrinali ,

couple of questions:

which openshift version is it? (if I want to reproduce it)
They never get to ready state or fail after some time?

aporrinali · April 18, 2023, 11:32am

Hello @Sergey_Pronin,

OpenShift is 4.12.10, and pods are never reach ready state, constantly restarting.

Sergey_Pronin · April 24, 2023, 10:50am

@Ivan_Pylypenko have you seen anything like this before?

Ivan_Pylypenko · April 24, 2023, 12:16pm

Hi guys

Nope, never seen it before. @aporrinali could you please share your CR configuration. If there any security sensitive info, please omit it.

aporrinali · May 4, 2023, 3:26pm

Hi @Ivan_Pylypenko,
Sorry for so late response…

Please see, but it looks the same…
‘txt’ = ‘yaml’ files
cw-bundle.txt (829.6 KB)
percona-server-mongodb-operator-5445fd995f-qwccb.log (683.0 KB)

minimal-cluster.txt (3.8 KB)

minimal-cluster-cfg-0.txt (9.2 KB)
minimal-cluster-cfg-0.log (77.8 KB)

minimal-cluster-mongos-0.txt (9.1 KB)
minimal-cluster-mongos-0.log (2.5 MB)

minimal-cluster-rs0-0.txt (9.2 KB)
minimal-cluster-rs0-0.log (36.0 KB)

aporrinali · May 16, 2023, 10:35pm

Small update.
Tried to run without sharding.
Operator log:

2023-05-16T22:29:56.459Z	ERROR	failed to reconcile cluster	{"controller": "psmdb-controller", "object": {"name":"mongo-minimal","namespace":"test-psmdb-1"}, "namespace": "test-psmdb-1", "name": "mongo-minimal", "reconcileID": "ce40ab38-9237-46bd-b5c6-bf47214d0098", "replset": "rs0", "error": "handleReplsetInit: exec add admin user: command terminated with exit code 1 / Warning: Could not access file: EACCES: permission denied, mkdir '/.mongodb'\nCurrent Mongosh Log ID:\t646403e4d9945bcf6c2227b8\nConnecting to:\t\tmongodb://127.0.0.1:27017/?directConnection=true&serverSelectionTimeoutMS=2000&appName=mongosh+1.6.2\nUsing MongoDB:\t\t6.0.4-3\nUsing Mongosh:\t\t1.6.2\n\nFor mongosh info see: https://docs.mongodb.com/mongodb-shell/\n\n\nTo help improve our products, anonymous usage data is collected and sent to MongoDB periodically (https://www.mongodb.com/legal/privacy-policy).\nYou can opt-out by running the disableTelemetry() command.\n\n\nError: Could not open history file.\nREPL session history will not be persisted.\n\u001b[1G\u001b[0J \u001b[1G / MongoServerError: command createUser requires authentication\n", "errorVerbose": "exec add admin user: command terminated with exit code 1 / Warning: Could not access file: EACCES: permission denied, mkdir '/.mongodb'\nCurrent Mongosh Log ID:\t646403e4d9945bcf6c2227b8\nConnecting to:\t\tmongodb://127.0.0.1:27017/?directConnection=true&serverSelectionTimeoutMS=2000&appName=mongosh+1.6.2\nUsing MongoDB:\t\t6.0.4-3\nUsing Mongosh:\t\t1.6.2\n\nFor mongosh info see: https://docs.mongodb.com/mongodb-shell/\n\n\nTo help improve our products, anonymous usage data is collected and sent to MongoDB periodically (https://www.mongodb.com/legal/privacy-policy).\nYou can opt-out by running the disableTelemetry() command.\n\n\nError: Could not open history file.\nREPL session history will not be persisted.\n\u001b[1G\u001b[0J \u001b[1G / MongoServerError: command createUser requires authentication\n\nhandleReplsetInit\ngithub.com/percona/percona-server-mongodb-operator/pkg/controller/perconaservermongodb.(*ReconcilePerconaServerMongoDB).reconcileCluster\n\t/go/src/github.com/percona/percona-server-mongodb-operator/pkg/controller/perconaservermongodb/mgo.go:99\ngithub.com/percona/percona-server-mongodb-operator/pkg/controller/perconaservermongodb.(*ReconcilePerconaServerMongoDB).Reconcile\n\t/go/src/github.com/percona/percona-server-mongodb-operator/pkg/controller/perconaservermongodb/psmdb_controller.go:487\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.4/pkg/internal/controller/controller.go:122\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.4/pkg/internal/controller/controller.go:323\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.4/pkg/internal/controller/controller.go:274\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.4/pkg/internal/controller/controller.go:235\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1594"}

{"t":{"$date":"2023-05-16T22:29:18.096+00:00"},"s":"I",  "c":"ACCESS",   "id":20249,   "ctx":"conn117","msg":"Authentication failed","attr":{"mechanism":"SCRAM-SHA-256","speculative":true,"principalName":"clusterMonitor","authenticationDatabase":"admin","remote":"172.20.13.112:48034","extraInfo":{},"error":"UserNotFound: Could not find user \"clusterMonitor\" for db \"admin\""}}

{"t":{"$date":"2023-05-16T22:29:18.097+00:00"},"s":"I",  "c":"ACCESS",   "id":20249,   "ctx":"conn117","msg":"Authentication failed","attr":{"mechanism":"SCRAM-SHA-1","speculative":false,"principalName":"clusterMonitor","authenticationDatabase":"admin","remote":"172.20.13.112:48034","extraInfo":{},"error":"UserNotFound: Could not find user \"clusterMonitor\" for db \"admin\""}}

ns-o-operator.log (104.1 KB)
mongo-minimal-ns-1.log (150.6 KB)

aporrinali · May 17, 2023, 3:35pm

Hello everyone,
I guess issue may be closed, the reason was found. And mostly because of our OpenShift setup…

From the beginning.
PSMDB operator monitors the namespace it resides in for the MongoDB. For testing purposes, a MongoDB cluster was launched in that namespace, and it started successfully and was accessible.

The problem arose when attempting to start MongoDB in a separate namespaces. The operator couldn’t complete the initialization of MongoDB in those namespaces. The MongoDB cluster kept restarting consistently. Here is a snippet from the operator’s log, which essentially shows the only error that could be worked with:

"error": "handleReplsetInit: exec add admin user: command terminated with exit code 1 / Warning: Could not access file: EACCES: permission denied, mkdir '/.mongodb'\nCurrent Mongosh Log ID:\t646403e4d9945bcf6c2227b8\nConnecting to:\t\tmongodb://127.0.0.1:27017/?directConnection=true&serverSelectionTimeoutMS=2000&appName=mongosh+1.6.2

The whole mystery was in the NetworkPolicy. I was creating projects either through the UI or with the oc new-project <name> command. And within it, some additional netpol were created.

❯ oc get netpol
NAME                                POD-SELECTOR   AGE
allow-from-openshift-ingress        <none>         54m
allow-from-openshift-monitoring     <none>         54m
allow-from-openshift-web-terminal   <none>         54m
allow-same-namespace                <none>         54m

❯ oc get netpol allow-from-openshift-ingress -o yaml

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-from-openshift-ingress
  namespace: test-psmdb-ns1
spec:
  ingress:
 - from:
   - namespaceSelector:
       matchLabels:
         network.openshift.io/policy-group: ingress
  podSelector: {}
  policyTypes:
 - Ingress
status: {}

The netpol/allow-from-openshift-ingress indicates that a specific label is required on the namespace to allow ingress, in this case, from the operator’s namespace.
Adding the label network.openshift.io/policy-group: ingress to the operator’s namespace resolved the issue, and the MongoDB cluster started successfully in the adjacent namespace.

Deleting netpol was not an option, because we do need it in our OpenShift setup.

Also was a bit helpful to know the difference between oc new-project <name> and oc create namespace <name>

aporrinali · May 18, 2023, 5:48am

In addition,
@Sergey_Pronin, are there any plans to implement cluster-wide mode setup from OpenShift OperatorHub?

Sergey_Pronin · May 19, 2023, 5:00pm

@aporrinali the thing is how operator hub is structured in the backend. It is either we create a separate product under operatorhub and maintain both or don’t do it at all. For now we have it in our backlog, but a bit hesitant to put more resources into it.

Is there any specific reason why you need OperatorHub? Is it a strong requirement for you?

aporrinali · May 27, 2023, 9:27am

Hello,

It’s a very good question and I agree that there are many pros and cons here…
I would say, because of another layer of verification and certification of Operators available through the Certified channel of OperatorHub.

aporrinali · June 9, 2023, 7:42am

Perhaps… OperatorHub is not such a good option… too many cons…

Anyway, thanks for your answer.

Topic		Replies	Views
Helm chart psmdb-operator 1.13.0 - Not aligned with the rest of 1.13.0 Percona Operator for MongoDB	3	612	September 28, 2022
Installing with official helm charts in dedicated namespace failing Percona Distribution for MongoDB	3	808	July 16, 2024
PSMDB stuck initializing Percona Operator for MongoDB	4	1661	January 9, 2023
Single percona-server-mongodb-operator deployment watching all namespaces for CRD PerconaServerMongoDB Percona Operator for MongoDB	2	1066	May 12, 2021
Missing key "meta.helm.sh/release-namespace": must be set Percona Operator for MongoDB	2	5070	March 21, 2022

Cluster-wide PSMDB Operator 1.14.0 on OpenShift: Watched namespace pods fail to transition into Ready state due to 'Could not find address' error

Related topics