Agent - Server connectivity

Cihan_Tunali · April 9, 2020, 2:42am

Hello all,
at PMM v2, I know pmm-admin check-network removed and list, status commands replaced behalf of it. But I have connectivity problem, I can successfully add an agent to the server but can not get any statistics about my agent such as CPU, RAM etc. I’m sure there are some network problems and I can see at pmm-admin list command. I have to define problem to be able to solve it. How can it find more details about network communication between agent-server? I have 2 firewalls between server and agent, but I need clues to work with my network engineer.

Here are the logs of pmm-admin list and pmm-admin status --debug
[root@linuxmachine log]# pmm-admin listPost https://pmmserver:443/v1/inventory/Agents/List: read tcp pmmagent:49816->pmmserver:443: read: connection reset by peer

[root@linuxmachine log]# pmm-admin status --debugDEBUG 2020-04-09 07:58:57.212962979Z: POST /local/Status HTTP/1.1Host: 127.0.0.1:7777User-Agent: Go-http-client/1.1Content-Length: 3Accept: application/jsonContent-Type: application/jsonAccept-Encoding: gzip
{}
DEBUG 2020-04-09 07:58:57.216275012Z: HTTP/1.1 200 OKContent-Length: 316Content-Type: application/jsonDate: Thu, 09 Apr 2020 07:58:57 GMTGrpc-Metadata-Content-Type: application/grpc
{“agent_id”:“/agent_id/049f902e-8bf4-44c7-9774-7968f31dc8da”,“runs_on_node_id”:“/node_id/fa42848f-6b6d-453d-95a4-9b087c08e1b5”,“server_info”:{“url”:“https://admin:admin@pmmserver:443/“,“insecure_tls”:true,“connected”:true,“version”:“2.3.0”},“config_filepath”:”/usr/local/percona/pmm2/config/pmm-agent.yaml”}DEBUG 2020-04-09 07:58:57.217122422Z: POST /local/Status HTTP/1.1Host: 127.0.0.1:7777User-Agent: Go-http-client/1.1Content-Length: 26Accept: application/jsonContent-Type: application/jsonAccept-Encoding: gzip
{“get_network_info”:true}
DEBUG 2020-04-09 08:04:57.722574234Z: HTTP/1.1 503 Service UnavailableConnection: closeContent-Length: 75Content-Type: application/jsonDate: Thu, 09 Apr 2020 08:04:57 GMT
{“error”:“transport is closing”,“code”:14,“message”:“transport is closing”}DEBUG 2020-04-09 08:04:57.72281479Z: Result: <nil>DEBUG 2020-04-09 08:04:57.72289094Z: Error: &agent_local.StatusDefault{_statusCode:503, Payload:(*agent_local.StatusDefaultBody)(0xc0000f8000)}transport is closing

Here is message log
Apr 9 11:04:57 linuxmachine pmm-agent: #033[31mERRO#033[0m[2020-04-09T11:04:57.720+03:00] Can’t get network info: failed to receive message: rpc error: code = Unavailable desc = transport is closing #033[31mcomponent#033[0m=local-serverApr 9 11:04:57 linuxmachine pmm-agent: #033[36mINFO#033[0m[2020-04-09T11:04:57.720+03:00] Done. #033[36mcomponent#033[0m=actions-runnerApr 9 11:04:57 linuxmachine pmm-agent: #033[36mINFO#033[0m[2020-04-09T11:04:57.720+03:00] Stopped. #033[36mcomponent#033[0m=local-server/JSONApr 9 11:04:57 linuxmachine pmm-agent: #033[36mINFO#033[0m[2020-04-09T11:04:57.720+03:00] Done. #033[36mcomponent#033[0m=supervisorApr 9 11:04:57 linuxmachine pmm-agent: #033[36mINFO#033[0m[2020-04-09T11:04:57.721+03:00] Done. #033[36mcomponent#033[0m=clientApr 9 11:04:58 linuxmachine pmm-agent: #033[36mINFO#033[0m[2020-04-09T11:04:58.221+03:00] Done. #033[36mcomponent#033[0m=local-serverApr 9 11:04:58 linuxmachine pmm-agent: #033[36mINFO#033[0m[2020-04-09T11:04:58.221+03:00] Starting… #033[36mcomponent#033[0m=clientApr 9 11:04:58 linuxmachine pmm-agent: #033[36mINFO#033[0m[2020-04-09T11:04:58.221+03:00] Connecting to https://admin:***@pmmserver:443/ … #033[36mcomponent#033[0m=clientApr 9 11:04:58 linuxmachine pmm-agent: #033[36mINFO#033[0m[2020-04-09T11:04:58.221+03:00] Starting local API server on http://127.0.0.1:7777/ … #033[36mcomponent#033[0m=local-server/JSONApr 9 11:04:58 linuxmachine pmm-agent: #033[36mINFO#033[0m[2020-04-09T11:04:58.224+03:00] Started. #033[36mcomponent#033[0m=local-server/JSONApr 9 11:04:58 linuxmachine pmm-agent: #033[36mINFO#033[0m[2020-04-09T11:04:58.235+03:00] Connected to pmmserver:443. #033[36mcomponent#033[0m=clientApr 9 11:04:58 linuxmachine pmm-agent: #033[36mINFO#033[0m[2020-04-09T11:04:58.235+03:00] Establishing two-way communication channel … #033[36mcomponent#033[0m=clientApr 9 11:04:58 linuxmachine pmm-agent: #033[36mINFO#033[0m[2020-04-09T11:04:58.241+03:00] Two-way communication channel established in 6.14482ms. Estimated clock drift: -1.494974ms. #033[36mcomponent#033[0m=client

steve.hoffman · April 9, 2020, 1:03pm

Ok…there’s actually several communication paths you need to be aware of to get it all working. I’ll do my best to list them and then give a few things to look at to get it resolved.
First is the API address on the client which binds to localhost on port 7777. This allows pmm-agent to talk to pmm-managed.
Second is the normal client –> server communication…it’s defaulted to https and works over whatever port you set up your pmm-server container to run on (typically 443). This is also the same communication channel that QAN works over.
Third is the Exporters which are server –> client. They run locally on your client side and typically bind to ports 4200x (where x can be 2,4,6,8…depending on the number of exporters you run on a single machine. In the case of linux with mysql running you’d likely get the linux server exporter bound to 42002 and mysql exporter on 42004.

Here’s a few diagrams that will help illustrate what I talked through up above.

To get it all up and running you’ll need to make sure all of your firewalls allow communication initiated in the right direction (so 443 from client to server and 42002 and 42004 from server to client). A huge help here can be looking at the prometheus targets page (https://pmmserver/prometheus/targets which will show timeout messages and the like (I expect you’ll see many errors there)…most of the time it’s a matter of opening up the firewall and enjoying the stream of data but there are also the cases where you do the initial registration of client to server and we attempt to detect the right interface but some more complex system setups involve multiple Network adapaters and we incorrectly register an unroutable IP to the PMM server so when we attempt to retrieve information from the exporter the pmm-server believes it should contact 10.0.0.3:42002 but that’s a private unroutable interface that the PMM server couldn’t talk to even if firewall rules were in place. In this instance you’ll need to unregister and reregister the client and pass the node-address parameter.

From what you described above, I’m going to guess that QAN works but if you didn’t add the mysql or postgres exporter there may be nothing to see there, I think the issue will turn out to be opening up ports 4200x from the server to the client in your firewall(s) and you’ll see the errors on the prometheus targets page magically disappear and within a few minutes the scrapes will populate the graphs!

Cihan_Tunali · April 10, 2020, 12:34am

Hi Steve!
Thank you for this detailed reply. I actually find the problem. Even every port are open between pmmserver-agent, I still got this problem. It is because of MTU package size ?!?!? At my inventory, MTU sizes are 9000 (higher than usual because of project plan) Agent can query the server and register itself without any problem. But when server try to query the agent for exporters, the package is getting bigger and the network layer won’t allow it to pass. So we changed that setting via network team and now everything is working

luk · April 13, 2021, 8:55am

Hi,
Are these client-server communication principles still applicable in PMM 2.15.1 ?
I observe pmm server does NOT try to send anything towards 4200x ports (on pmm server: tcpdump -i any -nn port 42000 or port 42001 or port 42002 and not host 127.0.0.1)
I only observe client-server traffic on 443 port, but this is supposed to be client-originated only, i guess.

steve.hoffman · April 13, 2021, 11:50am

No as as of a few versions ago (2.14.0 I think) we default all communication to push over the secure port used in your
pmm admin config
command. Which should dramatically reduce communication issues (if you can register the pmm server, all enabled metrics should flow over the same network path (client to server).

Are you seeing an issue or just wondering why there is no server to client communication?

luk · April 13, 2021, 12:11pm

no issues, i was just preparing firewall rules. so i only need to open 443 port inbound on PMM server side.

btw, if data is only pushed by client towards server (i.e. server is not actively querying the agents, right?) then would server detect any issue on client side if client machine crashes (and agent not running) ?

i think i faced situation when client machine was shutdown (gracefully) but no ‘db down’ alert was raised on PMM server for mysql service hosted on that client.

steve.hoffman · April 13, 2021, 1:50pm

There are no “automatic” alerts for lack of check-in from the client side…

we’d put them in a “down” status on the Inventory Page (under PMM Settings) or they’d show as down under https:///victoriametrics/targets but I believe those “healthchecks” still rely on PMM Server being able to see the exporter on TCP port 4200X (I asked though to be sure…will let you know what I hear). I think this feeds the “up” metric so before I tell you to key in on that I need to verify…

As for an actual alert, would be somewhat trivial to use our built-in integrated alerting to send an alert on a given node or exporter if a metric was not received in the last X seconds|minutes and that could be sent to email/pager/slack.

luk · April 13, 2021, 2:10pm

actually i am using the Integrated Alerting and its built-in alert rule ‘MySQL down’ which did not produce an alert when entire host went down.

steve.hoffman · April 13, 2021, 9:18pm

Hmmm, I see what you mean…but you’re measuring two different things there. mysql_up is the exporters assessment of “is the mysql process running and available” and then reported to victoriametrics…but if the exporter is down the value is neither 1 or 0…it’s null (or absent) which is not a trigger for the alert (looking for ==0). So the question is, do you want the “mysql is down” alert to fire when really “the exporter is down” which I would say is not a great proxy for “mysql is down”.

So maybe a better approach is to have an exporter specific alert like (up{agent_type="node_exporter"} or on() vector(0)) == 0 that will convert empty records to 0 and alert whenever a node_exporter agent is down…could be indicitive of “server down” but more definitively that pmm-agent is down/stopped.

luk · April 14, 2021, 7:49am

this topic gets away from original subject, but still interesting

thanks for all advices and explanations. i need to develop the alerts for sure. I am not familiar with prometheus stuff so used the already built in alert for mysql_up. Since there is no any for Galera out-of-box so i eventually need to get familiar.

b4buFr1k · January 31, 2022, 2:57pm

I have been testing that trick to monitor hosts down without success:

I am testing it executing query in PMM2-grafana explorer section and completely shutting down test server.

Have anybody monitored successfully when a host goes down and node_exporter metrics doesnt exists at the time?

Best regards.

Topic		Replies	Views
pmm2-client - Failed to establish two-way communication channel PMM 2.x pmm , percona	4	4887	January 8, 2021
Pmm2-client 2.19 Failed to get PMM Agent status from local pmm-agent: pmm-agent is not connected to PMM Server PMM 2.x	10	5729	December 23, 2021
linux metric from node not show in ppm2 dashboard PMM 2.x	7	912	September 18, 2020
Pmm-agent is not connected to PMM Server PMM 2.x	3	1671	June 23, 2022
Nodes are unable to connect server: Failed to establish two-way communication channel PMM 2.x	11	1498	October 6, 2021

Agent - Server connectivity

Related topics