Pmm-agent frequent disconnection with pmm-server

Description:

Hi everyone,
I’ve a pmm-server running as a k8s pod on GKE cluster and have configured mongodb database 3node cluster monitoring through pmm-agent. The mongodb database is running on GCE instances and is within the same vpc network.
I notice frequent disconnections between pmm-agent & pmm-server after every 2-3days. I’ve enabled loggin for pmm-agent but couldn’t find any error messages there. I’ve checked the network connectivity within pmm-agent & pmm-server,seems no issue there.
This has been a blocker for us to use percona in our prod envt.
Need help in troubleshooting as to why this frequent disconnections are happening.

Version:

Both my pmm-server & pmm-agent are running with the same version : 2.40.1

Any help would be appreciated.

Hi @Wali_Hasan.
Do you use Helm - Percona Monitoring and Management to deploy PMM in k82 and Percona Operator for MongoDB or other tools?

Also, what is the user and visual impact when the disconnect has happened? Is the monitoring stopped or just gaps in the data/graphs?

hi @Roma_Novikov ,
Thanks for the response.
I do not use helm or the percona operator for installation in k8s.
I dockerised the pmm server and then created the service using a simple k8s deployment yaml.
And,installed the pmm-client on mongodb instances using the percona official docs and registered the client & mongodb service using admin user.
The disconnect happens usually after 2-3 days from one of the mongo instances. Once it’s disconnected,the monitoring stops as i get an alert for it too that “Failed to establish two-way communication channel: No Agent with ID”.
I’m kinda stuck at this point as to why the connection breaks after some time.
However,after i restart the pmm-client service on mongodb instances,it then connects back again.

@Wali_Hasan, I would encourage you to use Percona Operator for MongoDB together with Helm - Percona Monitoring and Management

Our Operator runs optimally with PMM and ensures both of them work correctly.

Based on your error symptoms, I recommend you check the storage used for the pmm-client configuration. Maybe the files were destroyed when a new pod was created or some other activity occurred, and the Client lost his configuration.

@Roma_Novikov , why do i need percona operator for mongodb when my mongodb is running on instances and not on kubernetes as pods.
I installed pmm server on kubernetes as pod using helm percona-helm-charts/charts/pmm at main · percona/percona-helm-charts · GitHub .
Still i get pmm-agent down error when configured pmm-server using helm.
I looked into the logs /srv/log/pmm-managed.log and i find this particular error very often “Failed to get SSO details: PMM Server is not connected to Portal”…i’m using admin credentials to register my pmm-client to pmm-server.
I also updated to the latest version v2.41.0
I do not have any option but to restart the pmm-agent on my mongodb instances after i get the pmm-agent down error.
Please help,we really need this in production.

If you’re not running Mongodb in a K8s environment then you do not need the operator so disregard that part. What I’m most suspicious of is something breaking the gRPC over HTTPS which is how agents maintain connectivity with PMM server but I’m always naturally suspicious of network first (I can hear my old network buddies screaming at me “it’s not the network”. maybe selinux or firewalld or a software firewall between K8s and the VPC? I’ll post this thread internally in case one of the devs has ideas because even a network disconnect that’s short should still recover pretty transparently.

There was an issue someone else had a while back that was a job that cleaned up /tmp and was nuking the /tmp/node_exporter directory but if I recall that only holds credentials for the exporter to get data from the DB(s) being monitored, not how the client connects to the server. So are there any periodic jobs that would tamper with /tmp or possibly /usr/local/percona (this is where client to server credentials are stored).

Hi Wali, do you mount /srv directory in PMM Server as a persistent volume? Seems like K8s is restarting your PMM Server pod and PG data are being lost. Could you check PMM Inventory if your services exist there?

Failed to get SSO details: PMM Server is not connected to Portal

This error message isn’t related to this problem

Are PMM Clients installed as package or as docker container?

@steve.hoffman Hi,so while setting up pmm monitoring,I read issues related to network issues like reverse proxy,selinux,firewalld,vpc issues. So,I’m not using any kind of reverse proxy,my Mongodb Instances and my k8s nodes are within the same internal vpc,so that mitigates any kinds of error related to network.
And,there are no periodic jobs running as such.

@nurlan Hi,
Yes,i’m mounting /srv/path in my pmm-server pod as persistent volume,so that no data loss occurs.
K8s i not restarting the pmm-server pod,the pod has been running since i configured it,no restarts since then.
I checked my pmm inventory,all the services exist there.
The PMM client are installed as package in mongoDB instances.
I do not suspect it to be pmm-client issue but issue on pmm-server end.
I can share my helm values file which i’m using to install pmm-server,just in case.