We have setup PMM server (latest docker image v2.26) and configured our clients with pmm2-client v2.9.0. Around a day after all monitored hosts (~300) have been added to the server, we observe targets becoming unhealthy. This frequently occurs if a host reboots, but it is not the rule. On the client side we get Failed to establish two-way communication channel: context canceled. and on the server dial tcp4 X.X.X.X:42000: connect: connection refused; try -enableTCP6 command-line flag if you scrape ipv6 addresses. This is odd as everything works as expected until this point and there are no network changes.
At times, this can be resolved by restarting the pmm-agent on the client, but we have observed that this is not enough and resort to restarting the pmm-server docker container, as well. We have observed that this issue coincides with âpmm-managedâ service âhangingâ and not updating itsâ configuration (updateConfiguration took X.Ys.). We should also note that we have recompiled pmm-managed increasing the âconfigurationUpdateTimeoutâ to 120s.
What we are looking for is identifying the reason why this issue emerges and pinpoint the time of the event. Are there automated service restarts/checks that we should be aware of? Any tips on what should we looking for in our logs?
Please watch for our 2.27 release which should fix your issues. Also note that there is a workaround for the null pmm-agent versions, see those JIRAs whether they assist you.
Thank you for your reply!
The 9671 issue does not apply on this case, as there are no pmm-agents with null version and pmm-managed does not appear to be restarting.
pmm-managed=> select agent_id from agents where agent_type = 'pmm-agent' and version is null;
agent_id
----------
(0 rows)
There is also no panic log (as in 9614) or any other go stack trace that would indicate a problem
Our logs consist mostly of:
Updating versions... versions retrieving is not supported on pmm-agent Done. Next check in 5s
Round-trip time: Xms. Estimated clock drift: Yms.
Starting RPC /server.Server/Readiness ... RPC /server.Server/Readiness done in Xms.
Logs as those below indicate a recovery on the monitored host (usually after restaring the pmm-agent):
SetState response: . agent_id=/agent_id/<ID> request=<REQUEST>
sendSetStateRequest took 1.321725174s. agent_id=/agent_id/<ID> request=<REQUEST>
Configuration reloaded. component=victoriametrics
updateConfiguration took 2.956468448s. component=victoriametrics
Hi @george1 , thanks for your note. At this time I donât have any fresh information to provide in order to move your case forward, however Iâve asked internally for an Engineer to take a look at this thread and whether we can suggest further troubleshooting. Itâll probably be Monday before we get eyes on it. Have a great weekend,
hard to say, could you please attach your server and client/agent log for review?
also did you update clients as well?
Maybe also open jira ticket with those logs.
Looks like I have the same issue with MySQL/ProxySQL monitoring. Time to time pmm-server loosing connections with his clients with the same errors like @george1 has.
It startsd when I upgraded pmm infrastructure to 2.26 and after downgrade server to 2.23 I have 6 days without problems. Now it started again. I am using 2.23 server and clients 2.26.
Also I can confirm that PMM server tries to ask agents which are not in list of active agents at all.
OK, well that in itself could cause you issues. The version of the server should match the client, with server upgrades being performed ahead of client upgrades.
Debian Jessie went EOL nearly 2 years ago and I would recommend that you upgrade to a supported distro release.
I rollback mm-server and all clients back to previous stable version 2.23. Everything back to normal. Clients with 2.26 periodically lost connections with server.