PMM2 Client fails to send Query Analytics to Server

Hello All,

We are currently testing PMM2, it is working quiet well so far, however we experiencing some problems with one of our Nodes. We currently monitor a Galera Cluster of 5 Nodes, 4 are already in PMM2, 3 of this 4 nodes are working well, system metric, mysql metrics and slow query logs are collected/sent to the server well.

One node only sends its metrics but not the slow query log, after enabling debug mode, I’ve seen the following messages:


Jan 24 08:38:59 node1 pmm-agent[2]: #033[37mDEBU#033[0m[date] No need to reset: same file, old position 609104731, new file size 609104731. #033[37magentID#033[0m=/agent_id/641159FD-8C33-4778-9CCF-013EBE92D1A4 #033[37mcomponent#033[0m=slowlog/reader #033[37mfile#033[0m=/var/log/mysql/mysql-slow.log #033[37mtype#033[0m=qan_mysql_slowlog_agent
Jan 24 08:39:00 node1 pmm-agent[2]: #033[37mDEBU#033[0m[date] Made 231 buckets out of 231 classes in 08:38:25+34 interval. Wait time: 34.389665111s. #033[37magentID#033[0m=/agent_id/641159FD-8C33-4778-9CCF-013EBE92D1A4 #033[37mcomponent#033[0m=agent-builtin #033[37mtype#033[0m=qan_mysql_slowlog_agent
Jan 24 08:39:00 node1 pmm-agent[2]: #033[37mDEBU#033[0m[date] Scheduling next aggregation in 59.983990147s at 08:40:00. #033[37magentID#033[0m=/agent_id/641159FD-8C33-4778-9CCF-013EBE92D1A4 #033[37mcomponent#033[0m=agent-builtin #033[37mtype#033[0m=qan_mysql_slowlog_agent
Jan 24 08:39:00 node1 pmm-agent[2]: #033[36mINFO#033[0m[date] Sending 231 buckets. #033[36magentID#033[0m=/agent_id/641159FD-8C33-4778-9CCF-013EBE92D1A4 #033[36mcomponent#033[0m=agent-builtin #033[36mtype#033[0m=qan_mysql_slowlog_agent
Jan 24 08:39:00 node1 pmm-agent[2]: #033[37mDEBU#033[0m[date] Sending message (461889 bytes):
Jan 24 08:39:00 node1 pmm-agent[2]: id: 8
Jan 24 08:39:00 node1 pmm-agent[2]: qan_collect: <
Jan 24 08:39:00 node1 pmm-agent[2]: metrics_bucket: <
Jan 24 08:39:00 node1 pmm-agent[2]: common: <

... SLOW QUERY LOG ENTRIES ...

Jan 24 08:39:00 node1 pmm-agent[2]: #033[37mcomponent#033[0m=channel
Jan 24 08:39:00 node1 pmm-agent[2]: #033[37mDEBU#033[0m[date] Closing with error: rpc error: code = Internal desc = grpc: error while marshaling: proto: field "agent.MetricsBucket.Common.Example" contains invalid UTF-8
Jan 24 08:39:00 node1 pmm-agent[2]: failed to send message
Jan 24 08:39:00 node1 pmm-agent[2]: github.com/percona/pmm-agent/client/channel.(*Channel).send
Jan 24 08:39:00 node1 pmm-agent[2]: #011/tmp/go/src/github.com/percona/pmm-agent/client/channel/channel.go:183
Jan 24 08:39:00 node1 pmm-agent[2]: github.com/percona/pmm-agent/client/channel.(*Channel).SendRequest
Jan 24 08:39:00 node1 pmm-agent[2]: #011/tmp/go/src/github.com/percona/pmm-agent/client/channel/channel.go:156
Jan 24 08:39:00 node1 pmm-agent[2]: github.com/percona/pmm-agent/client.(*Client).processSupervisorRequests.func2
Jan 24 08:39:00 node1 pmm-agent[2]: #011/tmp/go/src/github.com/percona/pmm-agent/client/client.go:234
Jan 24 08:39:00 node1 pmm-agent[2]: runtime.goexit
Jan 24 08:39:00 node1 pmm-agent[2]: #011/usr/local/go/src/runtime/asm_amd64.s:1337 #033[37mcomponent#033[0m=channel
Jan 24 08:39:00 node1 pmm-agent[2]: #033[33mWARN#033[0m[date] Failed to send QanCollect request. #033[33mcomponent#033[0m=client
Jan 24 08:39:00 node1 pmm-agent[2]: #033[37mDEBU#033[0m[date] Exiting receiver goroutine. #033[37mcomponent#033[0m=channel
Jan 24 08:39:00 node1 pmm-agent[2]: #033[37mDEBU#033[0m[date] Channel closed: failed to send message: rpc error: code = Internal desc = grpc: error while marshaling: proto: field "agent.MetricsBucket.Common.Example" contains invalid UTF-8. #033[37mcomponent#033[0m=client
Jan 24 08:39:00 node1 pmm-agent[2]: #033[36mINFO#033[0m[date] Connection closed. #033[36mcomponent#033[0m=client
Jan 24 08:39:00 node1 pmm-agent[2]: #033[36mINFO#033[0m[date] Process: context canceled. #033[36magentID#033[0m=/agent_id/9C263F77-FFF7-49A9-B9B8-07B868566795 #033[36mcomponent#033[0m=agent-process #033[36mtype#033[0m=mysqld_exporter
Jan 24 08:39:00 node1 pmm-agent[2]: #033[36mINFO#033[0m[date] Process: context canceled. #033[36magentID#033[0m=/agent_id/0F3F8677-ADAF-4FAF-989F-5FC230CBE29F #033[36mcomponent#033[0m=agent-process #033[36mtype#033[0m=node_exporter
Jan 24 08:39:00 node1 pmm-agent[2]: #033[36mINFO#033[0m[date] Context done with context canceled. Reader closed with <nil>. #033[36magentID#033[0m=/agent_id/641159FD-8C33-4778-9CCF-013EBE92D1A4 #033[36mcomponent#033[0m=agent-builtin #033[36mtype#033[0m=qan_mysql_slowlog_agent
Jan 24 08:39:00 node1 pmm-agent[2]: #033[36mINFO#033[0m[date] Stopped. #033[36mcomponent#033[0m=local-server/JSON
Jan 24 08:39:00 node1 pmm-agent[2]: #033[36mINFO#033[0m[date] Starting 0, restarting 0, and stopping 2 agent processes. #033[36mcomponent#033[0m=supervisor
Jan 24 08:39:00 node1 pmm-agent[2]: #033[36mINFO#033[0m[date] Done. #033[36mcomponent#033[0m=actions-runner

Based on the messages, it seems, the rpc agent used internally is not able to marshall the message containing the slow query log, because of a non UTF-8 Character, however I analysed the slow query log of all servers and all slow query logs contain non UTF-8 characters, so why is only this one node not working?

May a bug in the pmm-agent? I am not sure, please help.

Thanks in advance,
Thomas

Hello, thanks for the question. I’ll get this to the team, but it may be we ask you to raise a bug report in jira.percona.com

Hello,
Of course, I can do that, in the meanwhile I have a few more information on that:

We now turned of slow query log on that server, cleared the logfile entirely, then we enabled the slow query log again and restarted pmm-agent, afterward the slow queries were reported to the server for a few minutes until the same error appeared again. Seems like a bug in pmm-agent while marshalling slow query logs containing non UTF-8 chars.

However I also took a look at the code and based on that it seems for me like a buggy response from the server, however, I think you are the experts :slight_smile:

I will open a bugreport, thank you!

I created a bug report as well, are noted down in the bug report, if there is anything you need, please contact me.

Thank you. What is the bug number ? Can you include it for reference ?

I wonder what query causes the issue - if you were able to catch that.

Hello Peter,
This is the Bug Number:
[URL][PMM-5365] Client fails to send non-UTF-8 query analytics content to server - Percona JIRA

I cannot figure out the query causing this error and it is also not possible to send the full message including the slow query log, since it is possible that there are really sensible data in there.

However, I analysed the systems a bit more and it seems not to be related to a non UTF-8 Char in the log, since it found out that in the slow query logs of all the systems are non UTF-8 chars. Additionally I’ve take a look at the code and may it is caused by some weired answer from the server component, however again, I am not sure.

Regards,
Thomas