Hi, I’m using operator percona-server-mongodb-operator.v1.13.1, installed via OCP marketplace and I’m experiencing some pods goingo to CrashLoopBackOff and restarting during the initialize phase:
# oc -n psmdb get po
NAME READY STATUS RESTARTS AGE
percona-server-mongodb-operator-676fbd9bbc-gbmws 1/1 Running 4 (19h ago) 25h
test-cluster-cfg-0 2/2 Running 342 (6m15s ago) 23h
test-cluster-cfg-1 1/2 CrashLoopBackOff 257 (3m53s ago) 23h
test-cluster-cfg-2 1/2 CrashLoopBackOff 317 (66s ago) 23h
test-cluster-mongos-0 0/1 Running 68 (50m ago) 24h
test-cluster-mongos-1 0/1 Running 58 (80s ago) 24h
test-cluster-mongos-2 0/1 Running 63 (49m ago) 23h
test-cluster-rs0-0 1/2 CrashLoopBackOff 274 (37s ago) 23h
test-cluster-rs0-1 1/2 CrashLoopBackOff 339 (3m49s ago) 24h
test-cluster-rs0-2 1/2 CrashLoopBackOff 289 (2m45s ago) 23h
Pod logs show some strange error:
# oc -n psmdb logs test-cluster-rs0-1
...
{"t":{"$date":"2023-02-17T16:28:01.856+00:00"},"s":"I", "c":"-", "id":4333222, "ctx":"ReplicaSetMonitor-TaskExecutor","msg":"RSM received error response","attr":{"host":"test-cluster-cfg-1.test-cluster-cfg.psmdb.svc.cluster.local:27017","error":"HostUnreachable: Error connecting to test-cluster-cfg-1.test-cluster-cfg.psmdb.svc.cluster.local:27017 :: caused by :: Could not find address for test-cluster-cfg-1.test-cluster-cfg.psmdb.svc.cluster.local:27017: SocketException: Host not found (authoritative)","replicaSet":"cfg","response":"{}"}}
{"t":{"$date":"2023-02-17T16:28:01.856+00:00"},"s":"I", "c":"NETWORK", "id":4712102, "ctx":"ReplicaSetMonitor-TaskExecutor","msg":"Host failed in replica set","attr":{"replicaSet":"cfg","host":"test-cluster-cfg-1.test-cluster-cfg.psmdb.svc.cluster.local:27017","error":{"code":6,"codeName":"HostUnreachable","errmsg":"Error connecting to test-cluster-cfg-1.test-cluster-cfg.psmdb.svc.cluster.local:27017 :: caused by :: Could not find address for test-cluster-cfg-1.test-cluster-cfg.psmdb.svc.cluster.local:27017: SocketException: Host not found (authoritative)"},"action":{"dropConnections":true,"requestImmediateCheck":true}}}
{"t":{"$date":"2023-02-17T16:28:01.858+00:00"},"s":"I", "c":"-", "id":4333222, "ctx":"ReplicaSetMonitor-TaskExecutor","msg":"RSM received error response","attr":{"host":"test-cluster-cfg-2.test-cluster-cfg.psmdb.svc.cluster.local:27017","error":"HostUnreachable: Error connecting to test-cluster-cfg-2.test-cluster-cfg.psmdb.svc.cluster.local:27017 :: caused by :: Could not find address for test-cluster-cfg-2.test-cluster-cfg.psmdb.svc.cluster.local:27017: SocketException: Host not found (authoritative)","replicaSet":"cfg","response":"{}"}}
{"t":{"$date":"2023-02-17T16:28:01.858+00:00"},"s":"I", "c":"NETWORK", "id":4712102, "ctx":"ReplicaSetMonitor-TaskExecutor","msg":"Host failed in replica set","attr":{"replicaSet":"cfg","host":"test-cluster-cfg-2.test-cluster-cfg.psmdb.svc.cluster.local:27017","error":{"code":6,"codeName":"HostUnreachable","errmsg":"Error connecting to test-cluster-cfg-2.test-cluster-cfg.psmdb.svc.cluster.local:27017 :: caused by :: Could not find address for test-cluster-cfg-2.test-cluster-cfg.psmdb.svc.cluster.local:27017: SocketException: Host not found (authoritative)"},"action":{"dropConnections":true,"requestImmediateCheck":false,"outcome":{"host":"test-cluster-cfg-2.test-cluster-cfg.psmdb.svc.cluster.local:27017","success":false,"errorMessage":"HostUnreachable: Error connecting to test-cluster-cfg-2.test-cluster-cfg.psmdb.svc.cluster.local:27017 :: caused by :: Could not find address for test-cluster-cfg-2.test-cluster-cfg.psmdb.svc.cluster.local:27017: SocketException: Host not found (authoritative)"}}}}
{"t":{"$date":"2023-02-17T16:28:01.897+00:00"},"s":"I", "c":"CONNPOOL", "id":22576, "ctx":"ShardRegistry","msg":"Connecting","attr":{"hostAndPort":"test-cluster-rs0-0.test-cluster-rs0.psmdb.svc.cluster.local:27017"}}
{"t":{"$date":"2023-02-17T16:28:02.358+00:00"},"s":"I", "c":"-", "id":4333222, "ctx":"ReplicaSetMonitor-TaskExecutor","msg":"RSM received error response","attr":{"host":"test-cluster-cfg-1.test-cluster-cfg.psmdb.svc.cluster.local:27017","error":"HostUnreachable: Error connecting to test-cluster-cfg-1.test-cluster-cfg.psmdb.svc.cluster.local:27017 :: caused by :: Could not find address for test-cluster-cfg-1.test-cluster-cfg.psmdb.svc.cluster.local:27017: SocketException: Host not found (authoritative)","replicaSet":"cfg","response":"{}"}}
{"t":{"$date":"2023-02-17T16:28:02.358+00:00"},"s":"I", "c":"NETWORK", "id":4712102, "ctx":"ReplicaSetMonitor-TaskExecutor","msg":"Host failed in replica set","attr":{"replicaSet":"cfg","host":"test-cluster-cfg-1.test-cluster-cfg.psmdb.svc.cluster.local:27017","error":{"code":6,"codeName":"HostUnreachable","errmsg":"Error connecting to test-cluster-cfg-1.test-cluster-cfg.psmdb.svc.cluster.local:27017 :: caused by :: Could not find address for test-cluster-cfg-1.test-cluster-cfg.psmdb.svc.cluster.local:27017: SocketException: Host not found (authoritative)"},"action":{"dropConnections":true,"requestImmediateCheck":false,"outcome":{"host":"test-cluster-cfg-1.test-cluster-cfg.psmdb.svc.cluster.local:27017","success":false,"errorMessage":"HostUnreachable: Error connecting to test-cluster-cfg-1.test-cluster-cfg.psmdb.svc.cluster.local:27017 :: caused by :: Could not find address for test-cluster-cfg-1.test-cluster-cfg.psmdb.svc.cluster.local:27017: SocketException: Host not found (authoritative)"}}}}
{"t":{"$date":"2023-02-17T16:28:02.359+00:00"},"s":"I", "c":"-", "id":4333222, "ctx":"ReplicaSetMonitor-TaskExecutor","msg":"RSM received error response","attr":{"host":"test-cluster-cfg-2.test-cluster-cfg.psmdb.svc.cluster.local:27017","error":"HostUnreachable: Error connecting to test-cluster-cfg-2.test-cluster-cfg.psmdb.svc.cluster.local:27017 :: caused by :: Could not find address for test-cluster-cfg-2.test-cluster-cfg.psmdb.svc.cluster.local:27017: SocketException: Host not found (authoritative)","replicaSet":"cfg","response":"{}"}}
{"t":{"$date":"2023-02-17T16:28:02.360+00:00"},"s":"I", "c":"NETWORK", "id":4712102, "ctx":"ReplicaSetMonitor-TaskExecutor","msg":"Host failed in replica set","attr":{"replicaSet":"cfg","host":"test-cluster-cfg-2.test-cluster-cfg.psmdb.svc.cluster.local:27017","error":{"code":6,"codeName":"HostUnreachable","errmsg":"Error connecting to test-cluster-cfg-2.test-cluster-cfg.psmdb.svc.cluster.local:27017 :: caused by :: Could not find address for test-cluster-cfg-2.test-cluster-cfg.psmdb.svc.cluster.local:27017: SocketException: Host not found (authoritative)"},"action":{"dropConnections":true,"requestImmediateCheck":true}}}
...
# oc -n psmdb logs pod/test-cluster-cfg-0
...
{"t":{"$date":"2023-02-17T16:29:41.317+00:00"},"s":"I", "c":"CONNPOOL", "id":22576, "ctx":"ReplNetwork","msg":"Connecting","attr":{"hostAndPort":"test-cluster-cfg-1.test-cluster-cfg.psmdb.svc.cluster.local:27017"}}
{"t":{"$date":"2023-02-17T16:29:41.425+00:00"},"s":"I", "c":"CONNPOOL", "id":22576, "ctx":"MirrorMaestro","msg":"Connecting","attr":{"hostAndPort":"test-cluster-cfg-2.test-cluster-cfg.psmdb.svc.cluster.local:27017"}}
{"t":{"$date":"2023-02-17T16:29:42.199+00:00"},"s":"I", "c":"NETWORK", "id":22943, "ctx":"listener","msg":"Connection accepted","attr":{"remote":"10.129.2.1:41070","uuid":"f167d474-21d5-4d88-ac08-c4042c9f26b7","connectionId":1438,"connectionCount":95}}
{"t":{"$date":"2023-02-17T16:29:42.200+00:00"},"s":"I", "c":"NETWORK", "id":22944, "ctx":"conn1438","msg":"Connection ended","attr":{"remote":"10.129.2.1:41070","uuid":"f167d474-21d5-4d88-ac08-c4042c9f26b7","connectionId":1438,"connectionCount":94}}
{"t":{"$date":"2023-02-17T16:29:42.317+00:00"},"s":"I", "c":"CONNPOOL", "id":22576, "ctx":"ReplNetwork","msg":"Connecting","attr":{"hostAndPort":"test-cluster-cfg-1.test-cluster-cfg.psmdb.svc.cluster.local:27017"}}
{"t":{"$date":"2023-02-17T16:29:42.425+00:00"},"s":"I", "c":"CONNPOOL", "id":22576, "ctx":"MirrorMaestro","msg":"Connecting","attr":{"hostAndPort":"test-cluster-cfg-2.test-cluster-cfg.psmdb.svc.cluster.local:27017"}}
{"t":{"$date":"2023-02-17T16:29:42.625+00:00"},"s":"I", "c":"REPL_HB", "id":23974, "ctx":"ReplCoord-16","msg":"Heartbeat failed after max retries","attr":{"target":"test-cluster-cfg-1.test-cluster-cfg.psmdb.svc.cluster.local:27017","maxHeartbeatRetries":2,"error":{"code":6,"codeName":"HostUnreachable","errmsg":"Error connecting to test-cluster-cfg-1.test-cluster-cfg.psmdb.svc.cluster.local:27017 :: caused by :: Could not find address for test-cluster-cfg-1.test-cluster-cfg.psmdb.svc.cluster.local:27017: SocketException: Host not found (authoritative)"}}}
{"t":{"$date":"2023-02-17T16:29:43.317+00:00"},"s":"I", "c":"CONNPOOL", "id":22576, "ctx":"ReplNetwork","msg":"Connecting","attr":{"hostAndPort":"test-cluster-cfg-1.test-cluster-cfg.psmdb.svc.cluster.local:27017"}}
{"t":{"$date":"2023-02-17T16:29:43.425+00:00"},"s":"I", "c":"CONNPOOL", "id":22576, "ctx":"MirrorMaestro","msg":"Connecting","attr":{"hostAndPort":"test-cluster-cfg-2.test-cluster-cfg.psmdb.svc.cluster.local:27017"}}
{"t":{"$date":"2023-02-17T16:29:44.317+00:00"},"s":"I", "c":"CONNPOOL", "id":22576, "ctx":"ReplNetwork","msg":"Connecting","attr":{"hostAndPort":"test-cluster-cfg-1.test-cluster-cfg.psmdb.svc.cluster.local:27017"}}
{"t":{"$date":"2023-02-17T16:29:44.425+00:00"},"s":"I", "c":"CONNPOOL", "id":22576, "ctx":"MirrorMaestro","msg":"Connecting","attr":{"hostAndPort":"test-cluster-cfg-2.test-cluster-cfg.psmdb.svc.cluster.local:27017"}}
{"t":{"$date":"2023-02-17T16:29:44.630+00:00"},"s":"I", "c":"REPL_HB", "id":23974, "ctx":"ReplCoord-18","msg":"Heartbeat failed after max retries","attr":{"target":"test-cluster-cfg-1.test-cluster-cfg.psmdb.svc.cluster.local:27017","maxHeartbeatRetries":2,"error":{"code":6,"codeName":"HostUnreachable","errmsg":"Error connecting to test-cluster-cfg-1.test-cluster-cfg.psmdb.svc.cluster.local:27017 :: caused by :: Could not find address for test-cluster-cfg-1.test-cluster-cfg.psmdb.svc.cluster.local:27017: SocketException: Host not found (authoritative)"}}}
...
While the operator logs report about the consequently problem while reconciling cluster:
# oc -n psmdb logs pod/percona-server-mongodb-operator-676fbd9bbc-gbmws
...
2023-02-17T15:25:31.601Z ERROR controller_psmdb failed to reconcile cluster {"Request.Namespace": "psmdb", "Request.Name": "test-cluster", "replset": "rs0", "error": "dial: ping mongo: server selection error: context deadline exceeded, current topology: { Type: ReplicaSetNoPrimary, Servers: [{ Addr: test-cluster-rs0-2.test-cluster-rs0.psmdb.svc.cluster.local:27017, Type: Unknown, Last error: connection() error occured during connection handshake: dial tcp: lookup test-cluster-rs0-2.test-cluster-rs0.psmdb.svc.cluster.local on 172.30.0.10:53: no such host }, { Addr: test-cluster-rs0-1.test-cluster-rs0.psmdb.svc.cluster.local:27017, Type: RSSecondary, Tag sets: serviceName=test-cluster,podName=test-cluster-rs0-1, Average RTT: 1154740 }, { Addr: test-cluster-rs0-0.test-cluster-rs0.psmdb.svc.cluster.local:27017, Type: Unknown, Last error: connection() error occured during connection handshake: dial tcp: lookup test-cluster-rs0-0.test-cluster-rs0.psmdb.svc.cluster.local on 172.30.0.10:53: no such host }, ] }", "errorVerbose": "server selection error: context deadline exceeded, current topology: { Type: ReplicaSetNoPrimary, Servers: [{ Addr: test-cluster-rs0-2.test-cluster-rs0.psmdb.svc.cluster.local:27017, Type: Unknown, Last error: connection() error occured during connection handshake: dial tcp: lookup test-cluster-rs0-2.test-cluster-rs0.psmdb.svc.cluster.local on 172.30.0.10:53: no such host }, { Addr: test-cluster-rs0-1.test-cluster-rs0.psmdb.svc.cluster.local:27017, Type: RSSecondary, Tag sets: serviceName=test-cluster,podName=test-cluster-rs0-1, Average RTT: 1154740 }, { Addr: test-cluster-rs0-0.test-cluster-rs0.psmdb.svc.cluster.local:27017, Type: Unknown, Last error: connection() error occured during connection handshake: dial tcp: lookup test-cluster-rs0-0.test-cluster-rs0.psmdb.svc.cluster.local on 172.30.0.10:53: no such host }, ] ...
...
2023-02-17T15:25:57.598Z INFO controller_psmdb Cluster state changed {"previous": "initializing", "current": "ready"}
2023-02-17T15:27:16.488Z ERROR controller_psmdb failed to reconcile cluster {"Request.Namespace": "psmdb", "Request.Name": "test-cluster", "replset": "cfg", "error": "dial: ping mongo: server selection error: context deadline exceeded, current topology: { Type: ReplicaSetNoPrimary, Servers: [{ Addr: test-cluster-cfg-2.test-cluster-cfg.psmdb.svc.cluster.local:27017, Type: RSSecondary, Tag sets: serviceName=test-cluster,podName=test-cluster-cfg-2, Average RTT: 882945 }, { Addr: test-cluster-cfg-0.test-cluster-cfg.psmdb.svc.cluster.local:27017, Type: Unknown, Last error: connection() error occured during connection handshake: dial tcp: lookup test-cluster-cfg-0.test-cluster-cfg.psmdb.svc.cluster.local on 172.30.0.10:53: no such host }, { Addr: test-cluster-cfg-1.test-cluster-cfg.psmdb.svc.cluster.local:27017, Type: RSSecondary, Tag sets: podName=test-cluster-cfg-1,serviceName=test-cluster, Average RTT: 835747 }, ] }", "errorVerbose": "server selection error: context deadline exceeded, current topology: { Type: ReplicaSetNoPrimary, Servers: [{ Addr: test-cluster-cfg-2.test-cluster-cfg.psmdb.svc.cluster.local:27017, Type: RSSecondary, Tag sets: serviceName=test-cluster,podName=test-cluster-cfg-2, Average RTT: 882945 }, { Addr: test-cluster-cfg-0.test-cluster-cfg.psmdb.svc.cluster.local:27017, Type: Unknown, Last error: connection() error occured during connection handshake: dial tcp: lookup test-cluster-cfg-0.test-cluster-cfg.psmdb.svc.cluster.local on 172.30.0.10:53: no such host }, { Addr: test-cluster-cfg-1.test-cluster-cfg.psmdb.svc.cluster.local:27017, Type: RSSecondary, Tag sets: podName=test-cluster-cfg-1,serviceName=test-cluster, Average RTT: 835747 } ...
I searched for a similar problem and I found this : Percona-mongo-psmdb-d is stuck in intializing status, unfortunaltely the post had non follow-up…
Any idea about a possible cause?
Thanks in advance