PMM targets becoming unhealthy after some time

We have setup PMM server (latest docker image v2.26) and configured our clients with pmm2-client v2.9.0. Around a day after all monitored hosts (~300) have been added to the server, we observe targets becoming unhealthy. This frequently occurs if a host reboots, but it is not the rule. On the client side we get Failed to establish two-way communication channel: context canceled. and on the server dial tcp4 X.X.X.X:42000: connect: connection refused; try -enableTCP6 command-line flag if you scrape ipv6 addresses. This is odd as everything works as expected until this point and there are no network changes.

At times, this can be resolved by restarting the pmm-agent on the client, but we have observed that this is not enough and resort to restarting the pmm-server docker container, as well. We have observed that this issue coincides with ‘pmm-managed’ service “hanging” and not updating its’ configuration (updateConfiguration took X.Ys.). We should also note that we have recompiled pmm-managed increasing the ‘configurationUpdateTimeout’ to 120s.

What we are looking for is identifying the reason why this issue emerges and pinpoint the time of the event. Are there automated service restarts/checks that we should be aware of? Any tips on what should we looking for in our logs?

3 Likes

Hi @george1 , welcome to the Percona forums!
We seem to have a couple of bugs identified with the PMM 2.26 release that are causing your unhealthy client situation.
[PMM-9671] pmm-managed is restarting continuously because of null version of pmm-agent - Percona JIRA - null pmm-agent version crashes pmm-managed

[PMM-9614] Upgrading PMM Server from 2.25 to 2.26 while monitoring a mongo with SSL enabled causes the agents to break. - Percona JIRA - migrations not run for some data in PG database (this ticket discusses MongoDB but I expect it affects any service_type aka MySQL)
If you would like to verify this, get a shell in the PMM Server docker container and check the logs for pmm-managed:

docker exec -it pmm-server bash
tail -f /srv/logs/pmm-managed.log

Please watch for our 2.27 release which should fix your issues. Also note that there is a workaround for the null pmm-agent versions, see those JIRAs whether they assist you.

1 Like

Thank you for your reply!
The 9671 issue does not apply on this case, as there are no pmm-agents with null version and pmm-managed does not appear to be restarting.

pmm-managed=> select agent_id from agents where agent_type = 'pmm-agent' and version is null;
 agent_id
----------
(0 rows)

There is also no panic log (as in 9614) or any other go stack trace that would indicate a problem

Our logs consist mostly of:

  • Updating versions... versions retrieving is not supported on pmm-agent Done. Next check in 5s
  • Round-trip time: Xms. Estimated clock drift: Yms.
  • Starting RPC /server.Server/Readiness ... RPC /server.Server/Readiness done in Xms.

Logs as those below indicate a recovery on the monitored host (usually after restaring the pmm-agent):

SetState response: .                          agent_id=/agent_id/<ID> request=<REQUEST>
sendSetStateRequest took 1.321725174s.        agent_id=/agent_id/<ID> request=<REQUEST>
Configuration reloaded.                       component=victoriametrics
updateConfiguration took 2.956468448s.        component=victoriametrics

Any hint would be greatly appreciated.

2 Likes

Hi @george1 , thanks for your note. At this time I don’t have any fresh information to provide in order to move your case forward, however I’ve asked internally for an Engineer to take a look at this thread and whether we can suggest further troubleshooting. It’ll probably be Monday before we get eyes on it. Have a great weekend,

1 Like

hard to say, could you please attach your server and client/agent log for review?
also did you update clients as well?
Maybe also open jira ticket with those logs.

1 Like

Looks like I have the same issue with MySQL/ProxySQL monitoring. Time to time pmm-server loosing connections with his clients with the same errors like @george1 has.
It startsd when I upgraded pmm infrastructure to 2.26 and after downgrade server to 2.23 I have 6 days without problems. Now it started again. I am using 2.23 server and clients 2.26.
Also I can confirm that PMM server tries to ask agents which are not in list of active agents at all.

pmm-managed.log file -Dropbox - pmm-managed.log - Simplify your life.
Can not upload here.

Please let me know if I can provide any other useful information

2 Likes

@george1 please could you clarify something?

We have setup PMM server (latest docker image v2.26) and configured our clients with pmm2-client v2.9.0.

You have shown a completely different version of the client in relation to the server. Is that correct?

2 Likes

This is correct. The pmm2-client running on targets is on version 2.9.0 (2.9.0-6.jessie). PMM server is the latest (2.26) docker image.

2 Likes

OK, well that in itself could cause you issues. The version of the server should match the client, with server upgrades being performed ahead of client upgrades.

Debian Jessie went EOL nearly 2 years ago and I would recommend that you upgrade to a supported distro release.

2 Likes

I rollback mm-server and all clients back to previous stable version 2.23. Everything back to normal. Clients with 2.26 periodically lost connections with server.

1 Like