PMM 2.22.0 in k8s

Stateros · October 6, 2021, 11:57am

Hello folks.

I am using PMM in docker on aws ec2 instance. Few weeks ago I implement helm chart for deploy PMM in k8s. Chart is pretty simple and works fine in dev environment with ~15 mysql/proxysql instances.

I deployed PMM in PROD and switch all clients to new deployment. It works ok, but not long about 1 hour. PMM server lost all clients. All clients shows:

~# pmm-admin list
Failed to get PMM Server parameters from local pmm-agent: pmm-agent is not connected to PMM Server.

And when I try to configure client it shows:

~# pmm-admin config 10.0.0.1 generic master.mysql.db --server-insecure-tls --server-url=http://user:pwd@pmm.domain.com/ --force
Warning: PMM Server requires TLS communications with client.
Checking local pmm-agent status...
pmm-agent is running.
Registering pmm-agent on PMM Server...
Failed to register pmm-agent on PMM Server: response from nginx: <html>
<head><title>504 Gateway Time-out</title></head>
<body>
<center><h1>504 Gateway Time-out</h1></center>
<hr><center>nginx</center>
</body>
</html>
.
Please check pmm-managed logs..

Grafana on pmm server continue works fine, but Inventory/Settings/Add instance are not response.
Downloaded logs from pmm-server didn’t show any understandable errors for me, but probably I missed sth.

First time I thought that pmm have not enough resources (cpu/memory). No, I give pmm dedicated Node like ec2 it had before.No any concurrency for resources. It didn’t help, it even gave opposite result - faster fail.

Any idea what is going on and how can I realise what I am doing wrong?

Michael_Coburn · October 6, 2021, 1:51pm

Hi @Stateros welcome back to the Percona forums!

One thing I noticed is that in your --server-url you are specifying http, however PMM Server ONLY accepts data from clients using TLS. So you MUST use --server-url=https://...

In order for us to assist you, would you please post your logs.zip so we can review?

Stateros · October 6, 2021, 2:26pm

Hello Michael, I tried http and https, output was the same. Nginx inside pmm pod was unavailable. If I restart the pod everything works fine some short time 40-60 mins and fail again.

Michael_Coburn · October 6, 2021, 4:15pm

Understood @Stateros

Could you attach logs.zip to this post please? We’ll take a look at what’s going on inside the container. Thanks,

steve.hoffman · October 6, 2021, 4:54pm

I’m afraid to chime in here because my K8s-foo is weak BUT how are you handling storage in K8s for the /srv directory inside PMM server? Since pods are ephemeral, the storage needs to be persistently mounted or when K8s does it’s orchestration thing and spins up a new pod to replace a defunct one all the registration data (and other persistent metrics) will be wiped as a new pod will be created from the image vs cloned from the running pod.

That would only possibly explain why the server would no longer be aware of the client.

As for the failure to register…I’m curious where the 504 error is coming from…the server.logs @Michael_Coburn requested would have PMM’s nginx logs and that might shed some light on it but I am half wondering if this is coming from your K8s ingress (assuming another nginx instance). That leads me to believe you nginx config may be incomplete. There’s another layer to consider…PMM client talks to PMM server both over https AND gRPC (encapsulated in https) but that means your load balancer (all of them) need to be able to forward on the GRPC to the right place. If you look inside a PMM server config (/etc/nginx/conf.d/pmm.conf) you’ll see how we’re proxying gRPC calls to the right ports.

Again, not savvy enough in K8s land to do much more than explain what we’re using and how that might be clashing with K8s infrastructure…but hopefully this gives a few places to look.

Stateros · October 6, 2021, 8:01pm

@Michael_Coburn I can not upload zip file, so all *.log files from the server just when issue happened.

@steve.hoffman I am using persistance storage for my POD and ingress + traefik. So restart pod/update server version etc doesn’t make any problem. I can share my helm chart with Percona team and community if it can help.
Also in /etc/nginx/conf.d/pmm.conf I made some change to make prometheus port 9090 available for external client like own Grafana. External access to PMM prometheus in pmm 2.10.1

alertmanager.log (1.1 KB)
clickhouse-server.log (6.7 KB)
dashboard-upgrade.log (137 Bytes)
grafana.log (180.6 KB)
nginx.log (146.3 KB)
pmm-agent.log (195.6 KB)
pmm-managed.log (188.9 KB)
pmm-version.txt (150 Bytes)
postgresql.log (1.4 KB)
qan-api2.log (145.2 KB)
supervisorctl_status.log (1.0 KB)
supervisord.log (4.7 KB)
systemctl_status.log (71 Bytes)
victoriametrics.log (1 MB)
vmalert.log (13.5 KB)

Topic		Replies	Views
PMM Server deployed on k8s - failed to establish two-way communication channel Database Monitoring and Management	4	1422	September 29, 2023
Pmm-client does not connect with pmm-server PMM 2.x	4	1443	September 22, 2021
PPM2 client cannot comunicate with pmm-server Percona Monitoring and Management (PMM) pmm	1	1473	April 28, 2021
Failed to register pmm-agent on PMM Server PMM 2.x	3	1207	March 31, 2022
Failed to register pmm-agent on PMM Server: response from nginx PMM 2.x pmm , percona	4	3123	October 4, 2021

PMM 2.22.0 in k8s

Related topics