Percona Server MongoDB stuck in initializing

Hi, I’m using operator percona-server-mongodb-operator.v1.13.1, installed via OCP marketplace and I’m experiencing some pods goingo to CrashLoopBackOff and restarting during the initialize phase:

# oc -n psmdb get po
NAME                                               READY   STATUS             RESTARTS          AGE
percona-server-mongodb-operator-676fbd9bbc-gbmws   1/1     Running            4 (19h ago)       25h
test-cluster-cfg-0                                 2/2     Running            342 (6m15s ago)   23h
test-cluster-cfg-1                                 1/2     CrashLoopBackOff   257 (3m53s ago)   23h
test-cluster-cfg-2                                 1/2     CrashLoopBackOff   317 (66s ago)     23h
test-cluster-mongos-0                              0/1     Running            68 (50m ago)      24h
test-cluster-mongos-1                              0/1     Running            58 (80s ago)      24h
test-cluster-mongos-2                              0/1     Running            63 (49m ago)      23h
test-cluster-rs0-0                                 1/2     CrashLoopBackOff   274 (37s ago)     23h
test-cluster-rs0-1                                 1/2     CrashLoopBackOff   339 (3m49s ago)   24h
test-cluster-rs0-2                                 1/2     CrashLoopBackOff   289 (2m45s ago)   23h

Pod logs show some strange error:

# oc -n psmdb logs test-cluster-rs0-1
...

{"t":{"$date":"2023-02-17T16:28:01.856+00:00"},"s":"I",  "c":"-",        "id":4333222, "ctx":"ReplicaSetMonitor-TaskExecutor","msg":"RSM received error response","attr":{"host":"test-cluster-cfg-1.test-cluster-cfg.psmdb.svc.cluster.local:27017","error":"HostUnreachable: Error connecting to test-cluster-cfg-1.test-cluster-cfg.psmdb.svc.cluster.local:27017 :: caused by :: Could not find address for test-cluster-cfg-1.test-cluster-cfg.psmdb.svc.cluster.local:27017: SocketException: Host not found (authoritative)","replicaSet":"cfg","response":"{}"}}
{"t":{"$date":"2023-02-17T16:28:01.856+00:00"},"s":"I",  "c":"NETWORK",  "id":4712102, "ctx":"ReplicaSetMonitor-TaskExecutor","msg":"Host failed in replica set","attr":{"replicaSet":"cfg","host":"test-cluster-cfg-1.test-cluster-cfg.psmdb.svc.cluster.local:27017","error":{"code":6,"codeName":"HostUnreachable","errmsg":"Error connecting to test-cluster-cfg-1.test-cluster-cfg.psmdb.svc.cluster.local:27017 :: caused by :: Could not find address for test-cluster-cfg-1.test-cluster-cfg.psmdb.svc.cluster.local:27017: SocketException: Host not found (authoritative)"},"action":{"dropConnections":true,"requestImmediateCheck":true}}}
{"t":{"$date":"2023-02-17T16:28:01.858+00:00"},"s":"I",  "c":"-",        "id":4333222, "ctx":"ReplicaSetMonitor-TaskExecutor","msg":"RSM received error response","attr":{"host":"test-cluster-cfg-2.test-cluster-cfg.psmdb.svc.cluster.local:27017","error":"HostUnreachable: Error connecting to test-cluster-cfg-2.test-cluster-cfg.psmdb.svc.cluster.local:27017 :: caused by :: Could not find address for test-cluster-cfg-2.test-cluster-cfg.psmdb.svc.cluster.local:27017: SocketException: Host not found (authoritative)","replicaSet":"cfg","response":"{}"}}
{"t":{"$date":"2023-02-17T16:28:01.858+00:00"},"s":"I",  "c":"NETWORK",  "id":4712102, "ctx":"ReplicaSetMonitor-TaskExecutor","msg":"Host failed in replica set","attr":{"replicaSet":"cfg","host":"test-cluster-cfg-2.test-cluster-cfg.psmdb.svc.cluster.local:27017","error":{"code":6,"codeName":"HostUnreachable","errmsg":"Error connecting to test-cluster-cfg-2.test-cluster-cfg.psmdb.svc.cluster.local:27017 :: caused by :: Could not find address for test-cluster-cfg-2.test-cluster-cfg.psmdb.svc.cluster.local:27017: SocketException: Host not found (authoritative)"},"action":{"dropConnections":true,"requestImmediateCheck":false,"outcome":{"host":"test-cluster-cfg-2.test-cluster-cfg.psmdb.svc.cluster.local:27017","success":false,"errorMessage":"HostUnreachable: Error connecting to test-cluster-cfg-2.test-cluster-cfg.psmdb.svc.cluster.local:27017 :: caused by :: Could not find address for test-cluster-cfg-2.test-cluster-cfg.psmdb.svc.cluster.local:27017: SocketException: Host not found (authoritative)"}}}}
{"t":{"$date":"2023-02-17T16:28:01.897+00:00"},"s":"I",  "c":"CONNPOOL", "id":22576,   "ctx":"ShardRegistry","msg":"Connecting","attr":{"hostAndPort":"test-cluster-rs0-0.test-cluster-rs0.psmdb.svc.cluster.local:27017"}}
{"t":{"$date":"2023-02-17T16:28:02.358+00:00"},"s":"I",  "c":"-",        "id":4333222, "ctx":"ReplicaSetMonitor-TaskExecutor","msg":"RSM received error response","attr":{"host":"test-cluster-cfg-1.test-cluster-cfg.psmdb.svc.cluster.local:27017","error":"HostUnreachable: Error connecting to test-cluster-cfg-1.test-cluster-cfg.psmdb.svc.cluster.local:27017 :: caused by :: Could not find address for test-cluster-cfg-1.test-cluster-cfg.psmdb.svc.cluster.local:27017: SocketException: Host not found (authoritative)","replicaSet":"cfg","response":"{}"}}
{"t":{"$date":"2023-02-17T16:28:02.358+00:00"},"s":"I",  "c":"NETWORK",  "id":4712102, "ctx":"ReplicaSetMonitor-TaskExecutor","msg":"Host failed in replica set","attr":{"replicaSet":"cfg","host":"test-cluster-cfg-1.test-cluster-cfg.psmdb.svc.cluster.local:27017","error":{"code":6,"codeName":"HostUnreachable","errmsg":"Error connecting to test-cluster-cfg-1.test-cluster-cfg.psmdb.svc.cluster.local:27017 :: caused by :: Could not find address for test-cluster-cfg-1.test-cluster-cfg.psmdb.svc.cluster.local:27017: SocketException: Host not found (authoritative)"},"action":{"dropConnections":true,"requestImmediateCheck":false,"outcome":{"host":"test-cluster-cfg-1.test-cluster-cfg.psmdb.svc.cluster.local:27017","success":false,"errorMessage":"HostUnreachable: Error connecting to test-cluster-cfg-1.test-cluster-cfg.psmdb.svc.cluster.local:27017 :: caused by :: Could not find address for test-cluster-cfg-1.test-cluster-cfg.psmdb.svc.cluster.local:27017: SocketException: Host not found (authoritative)"}}}}
{"t":{"$date":"2023-02-17T16:28:02.359+00:00"},"s":"I",  "c":"-",        "id":4333222, "ctx":"ReplicaSetMonitor-TaskExecutor","msg":"RSM received error response","attr":{"host":"test-cluster-cfg-2.test-cluster-cfg.psmdb.svc.cluster.local:27017","error":"HostUnreachable: Error connecting to test-cluster-cfg-2.test-cluster-cfg.psmdb.svc.cluster.local:27017 :: caused by :: Could not find address for test-cluster-cfg-2.test-cluster-cfg.psmdb.svc.cluster.local:27017: SocketException: Host not found (authoritative)","replicaSet":"cfg","response":"{}"}}
{"t":{"$date":"2023-02-17T16:28:02.360+00:00"},"s":"I",  "c":"NETWORK",  "id":4712102, "ctx":"ReplicaSetMonitor-TaskExecutor","msg":"Host failed in replica set","attr":{"replicaSet":"cfg","host":"test-cluster-cfg-2.test-cluster-cfg.psmdb.svc.cluster.local:27017","error":{"code":6,"codeName":"HostUnreachable","errmsg":"Error connecting to test-cluster-cfg-2.test-cluster-cfg.psmdb.svc.cluster.local:27017 :: caused by :: Could not find address for test-cluster-cfg-2.test-cluster-cfg.psmdb.svc.cluster.local:27017: SocketException: Host not found (authoritative)"},"action":{"dropConnections":true,"requestImmediateCheck":true}}}

...
#  oc -n psmdb logs pod/test-cluster-cfg-0
...
{"t":{"$date":"2023-02-17T16:29:41.317+00:00"},"s":"I",  "c":"CONNPOOL", "id":22576,   "ctx":"ReplNetwork","msg":"Connecting","attr":{"hostAndPort":"test-cluster-cfg-1.test-cluster-cfg.psmdb.svc.cluster.local:27017"}}
{"t":{"$date":"2023-02-17T16:29:41.425+00:00"},"s":"I",  "c":"CONNPOOL", "id":22576,   "ctx":"MirrorMaestro","msg":"Connecting","attr":{"hostAndPort":"test-cluster-cfg-2.test-cluster-cfg.psmdb.svc.cluster.local:27017"}}
{"t":{"$date":"2023-02-17T16:29:42.199+00:00"},"s":"I",  "c":"NETWORK",  "id":22943,   "ctx":"listener","msg":"Connection accepted","attr":{"remote":"10.129.2.1:41070","uuid":"f167d474-21d5-4d88-ac08-c4042c9f26b7","connectionId":1438,"connectionCount":95}}
{"t":{"$date":"2023-02-17T16:29:42.200+00:00"},"s":"I",  "c":"NETWORK",  "id":22944,   "ctx":"conn1438","msg":"Connection ended","attr":{"remote":"10.129.2.1:41070","uuid":"f167d474-21d5-4d88-ac08-c4042c9f26b7","connectionId":1438,"connectionCount":94}}
{"t":{"$date":"2023-02-17T16:29:42.317+00:00"},"s":"I",  "c":"CONNPOOL", "id":22576,   "ctx":"ReplNetwork","msg":"Connecting","attr":{"hostAndPort":"test-cluster-cfg-1.test-cluster-cfg.psmdb.svc.cluster.local:27017"}}
{"t":{"$date":"2023-02-17T16:29:42.425+00:00"},"s":"I",  "c":"CONNPOOL", "id":22576,   "ctx":"MirrorMaestro","msg":"Connecting","attr":{"hostAndPort":"test-cluster-cfg-2.test-cluster-cfg.psmdb.svc.cluster.local:27017"}}
{"t":{"$date":"2023-02-17T16:29:42.625+00:00"},"s":"I",  "c":"REPL_HB",  "id":23974,   "ctx":"ReplCoord-16","msg":"Heartbeat failed after max retries","attr":{"target":"test-cluster-cfg-1.test-cluster-cfg.psmdb.svc.cluster.local:27017","maxHeartbeatRetries":2,"error":{"code":6,"codeName":"HostUnreachable","errmsg":"Error connecting to test-cluster-cfg-1.test-cluster-cfg.psmdb.svc.cluster.local:27017 :: caused by :: Could not find address for test-cluster-cfg-1.test-cluster-cfg.psmdb.svc.cluster.local:27017: SocketException: Host not found (authoritative)"}}}
{"t":{"$date":"2023-02-17T16:29:43.317+00:00"},"s":"I",  "c":"CONNPOOL", "id":22576,   "ctx":"ReplNetwork","msg":"Connecting","attr":{"hostAndPort":"test-cluster-cfg-1.test-cluster-cfg.psmdb.svc.cluster.local:27017"}}
{"t":{"$date":"2023-02-17T16:29:43.425+00:00"},"s":"I",  "c":"CONNPOOL", "id":22576,   "ctx":"MirrorMaestro","msg":"Connecting","attr":{"hostAndPort":"test-cluster-cfg-2.test-cluster-cfg.psmdb.svc.cluster.local:27017"}}
{"t":{"$date":"2023-02-17T16:29:44.317+00:00"},"s":"I",  "c":"CONNPOOL", "id":22576,   "ctx":"ReplNetwork","msg":"Connecting","attr":{"hostAndPort":"test-cluster-cfg-1.test-cluster-cfg.psmdb.svc.cluster.local:27017"}}
{"t":{"$date":"2023-02-17T16:29:44.425+00:00"},"s":"I",  "c":"CONNPOOL", "id":22576,   "ctx":"MirrorMaestro","msg":"Connecting","attr":{"hostAndPort":"test-cluster-cfg-2.test-cluster-cfg.psmdb.svc.cluster.local:27017"}}
{"t":{"$date":"2023-02-17T16:29:44.630+00:00"},"s":"I",  "c":"REPL_HB",  "id":23974,   "ctx":"ReplCoord-18","msg":"Heartbeat failed after max retries","attr":{"target":"test-cluster-cfg-1.test-cluster-cfg.psmdb.svc.cluster.local:27017","maxHeartbeatRetries":2,"error":{"code":6,"codeName":"HostUnreachable","errmsg":"Error connecting to test-cluster-cfg-1.test-cluster-cfg.psmdb.svc.cluster.local:27017 :: caused by :: Could not find address for test-cluster-cfg-1.test-cluster-cfg.psmdb.svc.cluster.local:27017: SocketException: Host not found (authoritative)"}}}
...

While the operator logs report about the consequently problem while reconciling cluster:

# oc -n psmdb logs pod/percona-server-mongodb-operator-676fbd9bbc-gbmws
...
2023-02-17T15:25:31.601Z        ERROR   controller_psmdb        failed to reconcile cluster     {"Request.Namespace": "psmdb", "Request.Name": "test-cluster", "replset": "rs0", "error": "dial: ping mongo: server selection error: context deadline exceeded, current topology: { Type: ReplicaSetNoPrimary, Servers: [{ Addr: test-cluster-rs0-2.test-cluster-rs0.psmdb.svc.cluster.local:27017, Type: Unknown, Last error: connection() error occured during connection handshake: dial tcp: lookup test-cluster-rs0-2.test-cluster-rs0.psmdb.svc.cluster.local on 172.30.0.10:53: no such host }, { Addr: test-cluster-rs0-1.test-cluster-rs0.psmdb.svc.cluster.local:27017, Type: RSSecondary, Tag sets: serviceName=test-cluster,podName=test-cluster-rs0-1, Average RTT: 1154740 }, { Addr: test-cluster-rs0-0.test-cluster-rs0.psmdb.svc.cluster.local:27017, Type: Unknown, Last error: connection() error occured during connection handshake: dial tcp: lookup test-cluster-rs0-0.test-cluster-rs0.psmdb.svc.cluster.local on 172.30.0.10:53: no such host }, ] }", "errorVerbose": "server selection error: context deadline exceeded, current topology: { Type: ReplicaSetNoPrimary, Servers: [{ Addr: test-cluster-rs0-2.test-cluster-rs0.psmdb.svc.cluster.local:27017, Type: Unknown, Last error: connection() error occured during connection handshake: dial tcp: lookup test-cluster-rs0-2.test-cluster-rs0.psmdb.svc.cluster.local on 172.30.0.10:53: no such host }, { Addr: test-cluster-rs0-1.test-cluster-rs0.psmdb.svc.cluster.local:27017, Type: RSSecondary, Tag sets: serviceName=test-cluster,podName=test-cluster-rs0-1, Average RTT: 1154740 }, { Addr: test-cluster-rs0-0.test-cluster-rs0.psmdb.svc.cluster.local:27017, Type: Unknown, Last error: connection() error occured during connection handshake: dial tcp: lookup test-cluster-rs0-0.test-cluster-rs0.psmdb.svc.cluster.local on 172.30.0.10:53: no such host }, ] ...
...
2023-02-17T15:25:57.598Z        INFO    controller_psmdb        Cluster state changed   {"previous": "initializing", "current": "ready"}
2023-02-17T15:27:16.488Z        ERROR   controller_psmdb        failed to reconcile cluster     {"Request.Namespace": "psmdb", "Request.Name": "test-cluster", "replset": "cfg", "error": "dial: ping mongo: server selection error: context deadline exceeded, current topology: { Type: ReplicaSetNoPrimary, Servers: [{ Addr: test-cluster-cfg-2.test-cluster-cfg.psmdb.svc.cluster.local:27017, Type: RSSecondary, Tag sets: serviceName=test-cluster,podName=test-cluster-cfg-2, Average RTT: 882945 }, { Addr: test-cluster-cfg-0.test-cluster-cfg.psmdb.svc.cluster.local:27017, Type: Unknown, Last error: connection() error occured during connection handshake: dial tcp: lookup test-cluster-cfg-0.test-cluster-cfg.psmdb.svc.cluster.local on 172.30.0.10:53: no such host }, { Addr: test-cluster-cfg-1.test-cluster-cfg.psmdb.svc.cluster.local:27017, Type: RSSecondary, Tag sets: podName=test-cluster-cfg-1,serviceName=test-cluster, Average RTT: 835747 }, ] }", "errorVerbose": "server selection error: context deadline exceeded, current topology: { Type: ReplicaSetNoPrimary, Servers: [{ Addr: test-cluster-cfg-2.test-cluster-cfg.psmdb.svc.cluster.local:27017, Type: RSSecondary, Tag sets: serviceName=test-cluster,podName=test-cluster-cfg-2, Average RTT: 882945 }, { Addr: test-cluster-cfg-0.test-cluster-cfg.psmdb.svc.cluster.local:27017, Type: Unknown, Last error: connection() error occured during connection handshake: dial tcp: lookup test-cluster-cfg-0.test-cluster-cfg.psmdb.svc.cluster.local on 172.30.0.10:53: no such host }, { Addr: test-cluster-cfg-1.test-cluster-cfg.psmdb.svc.cluster.local:27017, Type: RSSecondary, Tag sets: podName=test-cluster-cfg-1,serviceName=test-cluster, Average RTT: 835747 } ...

I searched for a similar problem and I found this : Percona-mongo-psmdb-d is stuck in intializing status, unfortunaltely the post had non follow-up…

Any idea about a possible cause?

Thanks in advance

Hey @loudgefly ,

thanks for sharing. Let’s start with the infra level - is DNS healthy?
For example this error:

Could not find address for test-cluster-cfg-1.test-cluster-cfg.psmdb.svc.cluster.local:27017: SocketException: Host not found (authoritative)"

indicates that this entry was not created.

Can you please also list all the Service objects? oc get services

Hi @Sergey_Pronin,
here’s the services list:

oc -n psmdb get services
NAME                  TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)     AGE
test-cluster-cfg      ClusterIP   None             <none>        27017/TCP   3d17h
test-cluster-mongos   ClusterIP   172.30.121.191   <none>        27017/TCP   3d17h
test-cluster-rs0      ClusterIP   None             <none>        27017/TCP   3d17h

Regarding the DNS, everything seems to be working properly:

oc get clusteroperator/dns
NAME   VERSION                          AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
dns    4.10.0-0.okd-2022-07-09-073606   True        False         False      74d

And, if I log in a rs or cfg pod, I can resolve the pods hosts:

oc -n psmdb exec -it test-cluster-rs0-0 -- /bin/bash
Defaulted container "mongod" out of: mongod, backup-agent, mongo-init (init)
bash-4.4$ ping test-cluster-cfg-2.test-cluster-cfg.psmdb.svc.cluster.local
PING test-cluster-cfg-2.test-cluster-cfg.psmdb.svc.cluster.local (10.128.2.14) 56(84) bytes of data.
64 bytes from test-cluster-cfg-2.test-cluster-cfg.psmdb.svc.cluster.local (10.128.2.14): icmp_seq=1 ttl=64 time=0.524 ms
64 bytes from test-cluster-cfg-2.test-cluster-cfg.psmdb.svc.cluster.local (10.128.2.14): icmp_seq=2 ttl=64 time=0.175 ms
64 bytes from test-cluster-cfg-2.test-cluster-cfg.psmdb.svc.cluster.local (10.128.2.14): icmp_seq=3 ttl=64 time=0.191 ms
^C
--- test-cluster-cfg-2.test-cluster-cfg.psmdb.svc.cluster.local ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2012ms
rtt min/avg/max/mdev = 0.175/0.296/0.524/0.162 ms
bash-4.4$ ping test-cluster-cfg-1.test-cluster-cfg.psmdb.svc.cluster.local
PING test-cluster-cfg-1.test-cluster-cfg.psmdb.svc.cluster.local (10.131.0.14) 56(84) bytes of data.
64 bytes from test-cluster-cfg-1.test-cluster-cfg.psmdb.svc.cluster.local (10.131.0.14): icmp_seq=1 ttl=64 time=0.396 ms
64 bytes from test-cluster-cfg-1.test-cluster-cfg.psmdb.svc.cluster.local (10.131.0.14): icmp_seq=2 ttl=64 time=0.164 ms
^C
--- test-cluster-cfg-1.test-cluster-cfg.psmdb.svc.cluster.local ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1025ms
rtt min/avg/max/mdev = 0.164/0.280/0.396/0.116 ms
bash-4.4$ ping test-cluster-cfg-0.test-cluster-cfg.psmdb.svc.cluster.local
PING test-cluster-cfg-0.test-cluster-cfg.psmdb.svc.cluster.local (10.129.2.8) 56(84) bytes of data.
64 bytes from test-cluster-cfg-0.test-cluster-cfg.psmdb.svc.cluster.local (10.129.2.8): icmp_seq=1 ttl=64 time=0.208 ms
64 bytes from test-cluster-cfg-0.test-cluster-cfg.psmdb.svc.cluster.local (10.129.2.8): icmp_seq=2 ttl=64 time=0.045 ms
64 bytes from test-cluster-cfg-0.test-cluster-cfg.psmdb.svc.cluster.local (10.129.2.8): icmp_seq=3 ttl=64 time=0.044 ms
64 bytes from test-cluster-cfg-0.test-cluster-cfg.psmdb.svc.cluster.local (10.129.2.8): icmp_seq=4 ttl=64 time=0.044 ms
^C
--- test-cluster-cfg-0.test-cluster-cfg.psmdb.svc.cluster.local ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3098ms
rtt min/avg/max/mdev = 0.044/0.085/0.208/0.071 ms
bash-4.4$ ping test-cluster-rs0-0.test-cluster-rs0.psmdb.svc.cluster.local
PING test-cluster-rs0-0.test-cluster-rs0.psmdb.svc.cluster.local (10.129.2.9) 56(84) bytes of data.
64 bytes from test-cluster-rs0-0.test-cluster-rs0.psmdb.svc.cluster.local (10.129.2.9): icmp_seq=1 ttl=64 time=0.020 ms
64 bytes from test-cluster-rs0-0.test-cluster-rs0.psmdb.svc.cluster.local (10.129.2.9): icmp_seq=2 ttl=64 time=0.021 ms
64 bytes from test-cluster-rs0-0.test-cluster-rs0.psmdb.svc.cluster.local (10.129.2.9): icmp_seq=3 ttl=64 time=0.020 ms
^C
--- test-cluster-rs0-0.test-cluster-rs0.psmdb.svc.cluster.local ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2057ms
rtt min/avg/max/mdev = 0.020/0.020/0.021/0.003 ms

And

oc -n psmdb exec -it test-cluster-cfg-1 /bin/bash
kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl exec [POD] -- [COMMAND] instead.
Defaulted container "mongod" out of: mongod, backup-agent, mongo-init (init)
bash-4.4$ ping test-cluster-cfg-1.test-cluster-cfg.psmdb.svc.cluster.local
PING test-cluster-cfg-1.test-cluster-cfg.psmdb.svc.cluster.local (10.131.0.14) 56(84) bytes of data.
64 bytes from test-cluster-cfg-1.test-cluster-cfg.psmdb.svc.cluster.local (10.131.0.14): icmp_seq=1 ttl=64 time=0.024 ms
64 bytes from test-cluster-cfg-1.test-cluster-cfg.psmdb.svc.cluster.local (10.131.0.14): icmp_seq=2 ttl=64 time=0.023 ms
64 bytes from test-cluster-cfg-1.test-cluster-cfg.psmdb.svc.cluster.local (10.131.0.14): icmp_seq=3 ttl=64 time=0.038 ms
^C
--- test-cluster-cfg-1.test-cluster-cfg.psmdb.svc.cluster.local ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2088ms
rtt min/avg/max/mdev = 0.023/0.028/0.038/0.008 ms
bash-4.4$ ping test-cluster-cfg-0.test-cluster-cfg.psmdb.svc.cluster.local
PING test-cluster-cfg-0.test-cluster-cfg.psmdb.svc.cluster.local (10.129.2.8) 56(84) bytes of data.
64 bytes from test-cluster-cfg-0.test-cluster-cfg.psmdb.svc.cluster.local (10.129.2.8): icmp_seq=1 ttl=64 time=0.684 ms
64 bytes from test-cluster-cfg-0.test-cluster-cfg.psmdb.svc.cluster.local (10.129.2.8): icmp_seq=2 ttl=64 time=0.198 ms
64 bytes from test-cluster-cfg-0.test-cluster-cfg.psmdb.svc.cluster.local (10.129.2.8): icmp_seq=3 ttl=64 time=0.218 ms
^C
--- test-cluster-cfg-0.test-cluster-cfg.psmdb.svc.cluster.local ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2066ms
rtt min/avg/max/mdev = 0.198/0.366/0.684/0.225 ms
bash-4.4$ ping test-cluster-cfg-2.test-cluster-cfg.psmdb.svc.cluster.local
PING test-cluster-cfg-2.test-cluster-cfg.psmdb.svc.cluster.local (10.128.2.14) 56(84) bytes of data.
64 bytes from test-cluster-cfg-2.test-cluster-cfg.psmdb.svc.cluster.local (10.128.2.14): icmp_seq=1 ttl=64 time=0.389 ms
64 bytes from test-cluster-cfg-2.test-cluster-cfg.psmdb.svc.cluster.local (10.128.2.14): icmp_seq=2 ttl=64 time=0.147 ms
64 bytes from test-cluster-cfg-2.test-cluster-cfg.psmdb.svc.cluster.local (10.128.2.14): icmp_seq=3 ttl=64 time=0.205 ms
64 bytes from test-cluster-cfg-2.test-cluster-cfg.psmdb.svc.cluster.local (10.128.2.14): icmp_seq=4 ttl=64 time=0.141 ms
64 bytes from test-cluster-cfg-2.test-cluster-cfg.psmdb.svc.cluster.local (10.128.2.14): icmp_seq=5 ttl=64 time=0.180 ms
64 bytes from test-cluster-cfg-2.test-cluster-cfg.psmdb.svc.cluster.local (10.128.2.14): icmp_seq=6 ttl=64 time=0.135 ms
64 bytes from test-cluster-cfg-2.test-cluster-cfg.psmdb.svc.cluster.local (10.128.2.14): icmp_seq=7 ttl=64 time=0.123 ms
^C
--- test-cluster-cfg-2.test-cluster-cfg.psmdb.svc.cluster.local ping statistics ---
7 packets transmitted, 7 received, 0% packet loss, time 6121ms
rtt min/avg/max/mdev = 0.123/0.188/0.389/0.087 ms

All pod hostnames are resolved.
Pods are flapping up and down, so obviously some of them are not resolvable while in CrashLoopBackOff state.

Thanks a lot for the fast response, let me know if I can do something to further investigate…

After struggling for some hour investigating, I tried to destroy and re-create the cluster and It worked flawlessy…

Sorry to bother you and thanks a lot @Sergey_Pronin

Sorry two times @Sergey_Pronin,

after about an hour and pods are flapping up and down again…

Logs report same errors…

I double checked all system pods, nodes connectivity, etc, I don’t know what else check at this point