The server is not reachble fron other namespaces, pmm-agent getting i/o time out

Description:

I have installed the latest pmm helm chart V1.3.11, and I am having trouble reaching the pmm service from other namespaces, the pod seems to be runing fine but there is readiness error on the pod event log and I don’t know if this related to reachabilty issue.

Steps to Reproduce:

Install pmm helm chart on EKS cluster V1.29 using this values:

service:
  type: ClusterIP
nodeSelector:
  USAGE: MONITORING

Version:

Helm chart: 1.3.11
PMM image: 2.41.0

Logs:

pmm-agent log:
Failed to register pmm-agent on PMM Server: Post "https://monitoring-service.pmm.svc.cluster.local:443/v1/management/Node/Register": dial tcp 172.20.44.58:443: i/o timeou

pod event log:

Events:                                                                                                                                                                    │
│   Type     Reason     Age   From               Message                                                                                                                     │
│   ----     ------     ----  ----               -------                                                                                                                     │
│   Normal   Scheduled  78s   default-scheduler  Successfully assigned pmm/pmm-0 to ip-10-0-5-134.eu-central-1.compute.internal                                              │
│   Normal   Pulled     68s   kubelet            Container image "percona/pmm-server:2.41.1" already present on machine                                                      │
│   Normal   Created    68s   kubelet            Created container pmm                                                                                                       │
│   Normal   Started    68s   kubelet            Started container pmm                                                                                                       │
│   Warning  Unhealthy  67s   kubelet            Readiness probe failed: HTTP probe failed with statuscode: 500

curl inside pmm server on the rediness endpoint:

bash-5.1# curl -Iv http://127.0.0.1/v1/readyz
*   Trying 127.0.0.1:80...
* Connected to 127.0.0.1 (127.0.0.1) port 80 (#0)
> HEAD /v1/readyz HTTP/1.1
> Host: 127.0.0.1
> User-Agent: curl/7.76.1
> Accept: */*
>
* Mark bundle as not supporting multiuse
< HTTP/1.1 501 Not Implemented
HTTP/1.1 501 Not Implemented
< Server: nginx
Server: nginx
< Date: Wed, 28 Feb 2024 19:45:46 GMT
Date: Wed, 28 Feb 2024 19:45:46 GMT
< Content-Type: application/json
Content-Type: application/json
< Content-Length: 87
Content-Length: 87
< Connection: keep-alive
Connection: keep-alive
< Strict-Transport-Security: max-age=63072000; includeSubdomains;
Strict-Transport-Security: max-age=63072000; includeSubdomains;

<
* Connection #0 to host 127.0.0.1 left intact

Expected Result:

PMm server should be reachble fron other namespaces

Actual Result:

pmm-agent can’t be registred

Hi @Abdelnasser_FRIED

HEAD HTTP method is not supported by PMM, please send GET request.
Could you share logs from /srv/logs/ directory on PMM Server pod?

Hello @nurlan and thank you for the reply,

Here is the pmm-managed.logs:

bash-5.1# tail -f /srv/logs/pmm-managed.log
time="2024-02-29T06:33:53.205+00:00" level=info msg="Starting RPC /server.Server/Readiness ..." request=81a8c8f0-d6cc-11ee-9478-922a5a178739
time="2024-02-29T06:33:53.207+00:00" level=info msg="RPC /server.Server/Readiness done in 1.702188ms." request=81a8c8f0-d6cc-11ee-9478-922a5a178739
time="2024-02-29T06:34:03.206+00:00" level=info msg="Starting RPC /server.Server/Readiness ..." request=879eb8e4-d6cc-11ee-9478-922a5a178739
time="2024-02-29T06:34:03.208+00:00" level=info msg="RPC /server.Server/Readiness done in 1.826162ms." request=879eb8e4-d6cc-11ee-9478-922a5a178739
time="2024-02-29T06:34:13.205+00:00" level=info msg="Starting RPC /server.Server/Readiness ..." request=8d948af2-d6cc-11ee-9478-922a5a178739
time="2024-02-29T06:34:13.207+00:00" level=info msg="RPC /server.Server/Readiness done in 2.054546ms." request=8d948af2-d6cc-11ee-9478-922a5a178739
time="2024-02-29T06:34:23.205+00:00" level=info msg="Starting RPC /server.Server/Readiness ..." request=938a633a-d6cc-11ee-9478-922a5a178739
time="2024-02-29T06:34:23.207+00:00" level=info msg="RPC /server.Server/Readiness done in 1.932654ms." request=938a633a-d6cc-11ee-9478-922a5a178739
time="2024-02-29T06:34:33.205+00:00" level=info msg="Starting RPC /server.Server/Readiness ..." request=99804568-d6cc-11ee-9478-922a5a178739
time="2024-02-29T06:34:33.207+00:00" level=info msg="RPC /server.Server/Readiness done in 1.887562ms." request=99804568-d6cc-11ee-9478-922a5a178739
time="2024-02-29T06:34:43.206+00:00" level=info msg="Starting RPC /server.Server/Readiness ..." request=9f763a29-d6cc-11ee-9478-922a5a178739
time="2024-02-29T06:34:43.208+00:00" level=info msg="RPC /server.Server/Readiness done in 2.27617ms." request=9f763a29-d6cc-11ee-9478-922a5a178739

Nginx logs:

10.0.5.134 - - [29/Feb/2024:06:32:23 +0000] "GET /v1/readyz HTTP/1.1" 200 2 "-" "kube-probe/1.29+" "-"
10.0.48.46 - - [29/Feb/2024:06:32:23 +0000] "GET / HTTP/1.1" 302 138 "-" "ELB-HealthChecker/2.0" "-"
10.0.50.71 - - [29/Feb/2024:06:32:23 +0000] "GET / HTTP/1.1" 302 138 "-" "ELB-HealthChecker/2.0" "-"
10.0.5.134 - - [29/Feb/2024:06:32:33 +0000] "GET /v1/readyz HTTP/1.1" 200 2 "-" "kube-probe/1.29+" "-"
10.0.49.102 - - [29/Feb/2024:06:32:37 +0000] "GET / HTTP/1.1" 302 138 "-" "ELB-HealthChecker/2.0" "-"
10.0.48.46 - - [29/Feb/2024:06:32:38 +0000] "GET / HTTP/1.1" 302 138 "-" "ELB-HealthChecker/2.0" "-"
10.0.50.71 - - [29/Feb/2024:06:32:38 +0000] "GET / HTTP/1.1" 302 138 "-" "ELB-HealthChecker/2.0" "-"
10.0.5.134 - - [29/Feb/2024:06:32:43 +0000] "GET /v1/readyz HTTP/1.1" 200 2 "-" "kube-probe/1.29+" "-"
10.0.49.102 - - [29/Feb/2024:06:32:52 +0000] "GET / HTTP/1.1" 302 138 "-" "ELB-HealthChecker/2.0" "-"
10.0.5.134 - - [29/Feb/2024:06:32:53 +0000] "GET /v1/readyz HTTP/1.1" 200 2 "-" "kube-probe/1.29+" "-"
10.0.48.46 - - [29/Feb/2024:06:32:53 +0000] "GET / HTTP/1.1" 302 138 "-" "ELB-HealthChecker/2.0" "-"
10.0.50.71 - - [29/Feb/2024:06:32:53 +0000] "GET / HTTP/1.1" 302 138 "-" "ELB-HealthChecker/2.0" "-"
10.0.5.134 - - [29/Feb/2024:06:33:03 +0000] "GET /v1/readyz HTTP/1.1" 200 2 "-" "kube-probe/1.29+" "-"
10.0.49.102 - - [29/Feb/2024:06:33:07 +0000] "GET / HTTP/1.1" 302 138 "-" "ELB-HealthChecker/2.0" "-"
10.0.48.46 - - [29/Feb/2024:06:33:08 +0000] "GET / HTTP/1.1" 302 138 "-" "ELB-HealthChecker/2.0" "-"
10.0.50.71 - - [29/Feb/2024:06:33:08 +0000] "GET / HTTP/1.1" 302 138 "-" "ELB-HealthChecker/2.0" "-"
10.0.5.134 - - [29/Feb/2024:06:33:13 +0000] "GET /v1/readyz HTTP/1.1" 200 2 "-" "kube-probe/1.29+" "-"
10.0.49.102 - - [29/Feb/2024:06:33:22 +0000] "GET / HTTP/1.1" 302 138 "-" "ELB-HealthChecker/2.0" "-"
10.0.5.134 - - [29/Feb/2024:06:33:23 +0000] "GET /v1/readyz HTTP/1.1" 200 2 "-" "kube-probe/1.29+" "-"
10.0.48.46 - - [29/Feb/2024:06:33:23 +0000] "GET / HTTP/1.1" 302 138 "-" "ELB-HealthChecker/2.0" "-"
10.0.50.71 - - [29/Feb/2024:06:33:23 +0000] "GET / HTTP/1.1" 302 138 "-" "ELB-HealthChecker/2.0" "-"
10.0.5.134 - - [29/Feb/2024:06:33:33 +0000] "GET /v1/readyz HTTP/1.1" 200 2 "-" "kube-probe/1.29+" "-"
10.0.49.102 - - [29/Feb/2024:06:33:37 +0000] "GET / HTTP/1.1" 302 138 "-" "ELB-HealthChecker/2.0" "-"
10.0.48.46 - - [29/Feb/2024:06:33:38 +0000] "GET / HTTP/1.1" 302 138 "-" "ELB-HealthChecker/2.0" "-"
10.0.50.71 - - [29/Feb/2024:06:33:39 +0000] "GET / HTTP/1.1" 302 138 "-" "ELB-HealthChecker/2.0" "-"
10.0.5.134 - - [29/Feb/2024:06:33:43 +0000] "GET /v1/readyz HTTP/1.1" 200 2 "-" "kube-probe/1.29+" "-"
10.0.49.102 - - [29/Feb/2024:06:33:52 +0000] "GET / HTTP/1.1" 302 138 "-" "ELB-HealthChecker/2.0" "-"
10.0.5.134 - - [29/Feb/2024:06:33:53 +0000] "GET /v1/readyz HTTP/1.1" 200 2 "-" "kube-probe/1.29+" "-"
10.0.48.46 - - [29/Feb/2024:06:33:53 +0000] "GET / HTTP/1.1" 302 138 "-" "ELB-HealthChecker/2.0" "-"
10.0.50.71 - - [29/Feb/2024:06:33:54 +0000] "GET / HTTP/1.1" 302 138 "-" "ELB-HealthChecker/2.0" "-"
10.0.5.134 - - [29/Feb/2024:06:34:03 +0000] "GET /v1/readyz HTTP/1.1" 200 2 "-" "kube-probe/1.29+" "-"
10.0.49.102 - - [29/Feb/2024:06:34:07 +0000] "GET / HTTP/1.1" 302 138 "-" "ELB-HealthChecker/2.0" "-"
10.0.48.46 - - [29/Feb/2024:06:34:08 +0000] "GET / HTTP/1.1" 302 138 "-" "ELB-HealthChecker/2.0" "-"
10.0.50.71 - - [29/Feb/2024:06:34:09 +0000] "GET / HTTP/1.1" 302 138 "-" "ELB-HealthChecker/2.0" "-"
10.0.5.134 - - [29/Feb/2024:06:34:13 +0000] "GET /v1/readyz HTTP/1.1" 200 2 "-" "kube-probe/1.29+" "-"
10.0.49.102 - - [29/Feb/2024:06:34:22 +0000] "GET / HTTP/1.1" 302 138 "-" "ELB-HealthChecker/2.0" "-"
10.0.5.134 - - [29/Feb/2024:06:34:23 +0000] "GET /v1/readyz HTTP/1.1" 200 2 "-" "kube-probe/1.29+" "-"
10.0.48.46 - - [29/Feb/2024:06:34:23 +0000] "GET / HTTP/1.1" 302 138 "-" "ELB-HealthChecker/2.0" "-"
10.0.50.71 - - [29/Feb/2024:06:34:24 +0000] "GET / HTTP/1.1" 302 138 "-" "ELB-HealthChecker/2.0" "-"
10.0.5.134 - - [29/Feb/2024:06:34:33 +0000] "GET /v1/readyz HTTP/1.1" 200 2 "-" "kube-probe/1.29+" "-"

is there specific service logs to show you, I think this is networking issue

as far as I see everything is fine from PMM Server side, probably you are right and it’s networking issue in K8s. Can you curl PMM Server from PMM Client?