I recently migrated my database monitoring to a new set of pmm2.18 (from 2.5). Monitoring over 600 instances, cpu usage was around 40% (8 core) at pmm2.5. After migrating to pmm2.18, the cpu usage is 100% with a 12 core server. I didn’t even open the grafana page …
Few thoughts:
- May be you could try to upgrade to the latest available 2.21
- Did you also update the pmm-client on each of the 600 nodes? There are significant changes between 2.5 and 2.21 so definitely the clients must be upgraded too.
Had similar issue in the past, I did manage to fix it with removing and adding the nodes with same name (to preserve the data points). In my case I had low count of nodes so it was relatively easy to accomplish. I did use pmm-admin config --metrics-mode=push --force ...
to re-add each node.
In any case I can recommend you to try and upgrade to the latest available version both for pmm-server and pmm-clients. Also check out here, as it’s quite possible someone do provide further details with more in depth knowledge on the subject.
Hi, thanks for your reply @ browseman, my pmm-client and server are both at 2.18. I’m not lucky, if 2.21 had come out earlier I would have upgraded to 2.21, because the clickhuse in 2.21 was upgraded too (I have a need for the window function), now I think I’ll be tired if I upgrade to 2.21, I just upgraded my server to 32 cores, now the cpu usage is around 45%
I found that some instances of the pmm-agent kept reporting errors
eg
Sep 6 21:04:52 bj2-mysql-other-prod-02 pmm-agent: #033[36mINFO#033[0m[2021-09-06T21:04:52.230+08:00] 2021-09-06T13:04:52.230Z#011error#011VictoriaMetrics/lib/promscrape/scrapework.go:231#011error when scraping "http://127.0.0.1:42017/metrics?collect%5B%5D=custom_query.hr&collect%5B%5D=global_status&collect%5B%5D=info_schema.innodb_metrics&collect%5B%5D=standard.go&collect%5B%5D=standard.process" from job "mysqld_exporter_agent_id_f68bdaac-6bd6-491a-b61a-dbae5e64270f_hr-5s" with labels {agent_id="/agent_id/f68bdaac-6bd6-491a-b61a-dbae5e64270f",agent_type="mysqld_exporter",cluster="mha_rd_ycms",env="prod",environment="prod",instance="/agent_id/f68bdaac-6bd6-491a-b61a-dbae5e64270f",job="mysqld_exporter_agent_id_f68bdaac-6bd6-491a-b61a-dbae5e64270f_hr-5s",machine_id="/machine_id/4284cfe27c5b48c38adde30f33b4bb60",node_id="/node_id/65204015-8e53-4754-b0c9-c95f157d92df",node_name="bj2-mysql-other-prod-02",node_type="generic",region="bj",replication_set="mha_rd_ycms",service_id="/service_id/66885a71-cf22-400b-91dc-27312f6193de",service_name="bj2-mysql-other-prod-02_3311",service_type="mysql"}: error when scraping "http://127.0.0.1:42017/metrics?collect%5B%5D=custom_query.hr&collect%5B%5D=global_status&collect%5B%5D=info_schema.innodb_metrics&collect%5B%5D=standard.go&collect%5B%5D=standard.process": dial tcp4 127.0.0.1:42017: connect: connection refused; try -enableTCP6 command-line flag if you scrape ipv6 addresses #033[36magentID#033[0m=/agent_id/381cc441-e286-494e-9be9-b74ddee74fd7 #033[36mcomponent#033[0m=agent-process #033[36mtype#033[0m=vm_agent
Sep 6 21:04:52 bj2-mysql-other-prod-02 pmm-agent: #033[36mINFO#033[0m[2021-09-06T21:04:52.929+08:00] 2021-09-06T13:04:52.929Z#011error#011VictoriaMetrics/lib/promscrape/scrapework.go:231#011error when scraping "http://127.0.0.1:42017/metrics?collect%5B%5D=custom_query.mr&collect%5B%5D=engine_innodb_status&collect%5B%5D=info_schema.innodb_cmp&collect%5B%5D=info_schema.innodb_cmpmem&collect%5B%5D=info_schema.processlist&collect%5B%5D=info_schema.query_response_time&collect%5B%5D=perf_schema.eventswaits&collect%5B%5D=perf_schema.file_events&collect%5B%5D=perf_schema.tablelocks&collect%5B%5D=slave_status" from job "mysqld_exporter_agent_id_f68bdaac-6bd6-491a-b61a-dbae5e64270f_mr-10s" with labels {agent_id="/agent_id/f68bdaac-6bd6-491a-b61a-dbae5e64270f",agent_type="mysqld_exporter",cluster="mha_rd_ycms",env="prod",environment="prod",instance="/agent_id/f68bdaac-6bd6-491a-b61a-dbae5e64270f",job="mysqld_exporter_agent_id_f68bdaac-6bd6-491a-b61a-dbae5e64270f_mr-10s",machine_id="/machine_id/4284cfe27c5b48c38adde30f33b4bb60",node_id="/node_id/65204015-8e53-4754-b0c9-c95f157d92df",node_name="bj2-mysql-other-prod-02",node_type="generic",region="bj",replication_set="mha_rd_ycms",service_id="/service_id/66885a71-cf22-400b-91dc-27312f6193de",service_name="bj2-mysql-other-prod-02_3311",service_type="mysql"}: error when scraping "http://127.0.0.1:42017/metrics?collect%5B%5D=custom_query.mr&collect%5B%5D=engine_innodb_status&collect%5B%5D=info_schema.innodb_cmp&collect%5B%5D=info_schema.innodb_cmpmem&collect%5B%5D=info_schema.processlist&collect%5B%5D=info_schema.query_response_time&collect%5B%5D=perf_schema.eventswaits&collect%5B%5D=perf_schema.file_events&collect%5B%5D=perf_schema.tablelocks&collect%5B%5D=slave_status": dial tcp4 127.0.0.1:42017: connect: connection refused; try -enableTCP6 command-line flag if you scrape ipv6 addresses #033[36magentID#033[0m=/agent_id/381cc441-e286-494e-9be9-b74ddee74fd7 #033[36mcomponent#033[0m=agent-process #033[36mtype#033[0m=vm_agent
3[0m=/agent_id/b6e72b0b-e94c-410e-b4cf-31edbcd911f9 #033[36mcomponent#033[0m=agent-process #033[36mtype#033[0m=mysqld_exporter
Sep 6 21:07:44 bj2-mysql-huoyun2-prod-01 pmm-agent: panic: runtime error: index out of range [54] with length 54
Sep 6 21:07:44 bj2-mysql-huoyun2-prod-01 pmm-agent: goroutine 546 [running]:
Sep 6 21:07:44 bj2-mysql-huoyun2-prod-01 pmm-agent: github.com/percona/go-mysql/query.Fingerprint(0xc000476000, 0x36, 0x142582d, 0x1a)
Sep 6 21:07:44 bj2-mysql-huoyun2-prod-01 pmm-agent: /tmp/go/pkg/mod/github.com/percona/go-mysql@v0.0.0-20200630114833-b77f37c0bfa2/query/query.go:559 +0x4edb
Sep 6 21:07:44 bj2-mysql-huoyun2-prod-01 pmm-agent: github.com/percona/pmm-agent/agents/mysql/slowlog.(*SlowLog).processFile(0xc0004d2270, 0x1631bb0, 0xc0005e0440, 0xc0006152e0, 0x1e, 0x3ff0000000000000, 0x0, 0x0)
Sep 6 21:07:44 bj2-mysql-huoyun2-prod-01 pmm-agent: /tmp/go/src/github.com/percona/pmm-agent/agents/mysql/slowlog/slowlog.go:347 +0x11c5
Sep 6 21:07:44 bj2-mysql-huoyun2-prod-01 pmm-agent: github.com/percona/pmm-agent/agents/mysql/slowlog.(*SlowLog).Run.func3(0xc0004d2270, 0xc0006152e0, 0x1e, 0xc0004ee120, 0x1631bb0, 0xc0005e0440, 0x3ff0000000000000)
Sep 6 21:07:44 bj2-mysql-huoyun2-prod-01 pmm-agent: /tmp/go/src/github.com/percona/pmm-agent/agents/mysql/slowlog/slowlog.go:141 +0xf3
Sep 6 21:07:44 bj2-mysql-huoyun2-prod-01 pmm-agent: created by github.com/percona/pmm-agent/agents/mysql/slowlog.(*SlowLog).Run
Sep 6 21:07:44 bj2-mysql-huoyun2-prod-01 pmm-agent: /tmp/go/src/github.com/percona/pmm-agent/agents/mysql/slowlog/slowlog.go:139 +0x2be
Sep 6 21:07:44 bj2-mysql-huoyun2-prod-01 systemd: pmm-agent.service: main process exited, code=exited, status=2/INVALIDARGUMENT
Sep 6 21:07:44 bj2-mysql-huoyun2-prod-01 systemd: Unit pmm-agent.service entered failed state.
Sep 6 21:07:44 bj2-mysql-huoyun2-prod-01 systemd: pmm-agent.service failed.
My mysql is a single multi-instance deployment, and the pmm-agent output logs are now a bit large and worthless, resulting in my /var/log/message file being too large
Is this log normal(/srv/logs/victoriametrics.log), reloading every second?
21-09-06T13:56:07.919Z info /home/builder/rpm/BUILD/VictoriaMetrics-pmm-6401-v1.53.1/lib/promscrape/scraper.go:129 nothing changed in "/etc/victoriametrics-promscrape.yml"
2021-09-06T13:56:09.487Z info /home/builder/rpm/BUILD/VictoriaMetrics-pmm-6401-v1.53.1/lib/promscrape/scraper.go:122 SIGHUP received; reloading Prometheus configs from "/etc/victoriametrics-promscrape.yml"
2021-09-06T13:56:09.488Z info /home/builder/rpm/BUILD/VictoriaMetrics-pmm-6401-v1.53.1/lib/promscrape/scraper.go:129 nothing changed in "/etc/victoriametrics-promscrape.yml"
2021-09-06T13:56:11.108Z info /home/builder/rpm/BUILD/VictoriaMetrics-pmm-6401-v1.53.1/lib/promscrape/scraper.go:122 SIGHUP received; reloading Prometheus configs from "/etc/victoriametrics-promscrape.yml"
2021-09-06T13:56:11.109Z info /home/builder/rpm/BUILD/VictoriaMetrics-pmm-6401-v1.53.1/lib/promscrape/scraper.go:129 nothing changed in "/etc/victoriametrics-promscrape.yml"
2021-09-06T13:56:12.734Z info /home/builder/rpm/BUILD/VictoriaMetrics-pmm-6401-v1.53.1/lib/promscrape/scraper.go:122 SIGHUP received; reloading Prometheus configs from "/etc/victoriametrics-promscrape.yml"
2021-09-06T13:56:12.735Z info /home/builder/rpm/BUILD/VictoriaMetrics-pmm-6401-v1.53.1/lib/promscrape/scraper.go:129 nothing changed in "/etc/victoriametrics-promscrape.yml"
2021-09-06T13:56:14.345Z info /home/builder/rpm/BUILD/VictoriaMetrics-pmm-6401-v1.53.1/lib/promscrape/scraper.go:122 SIGHUP received; reloading Prometheus configs from "/etc/victoriametrics-promscrape.yml"
2021-09-06T13:56:14.346Z info /home/builder/rpm/BUILD/VictoriaMetrics-pmm-6401-v1.53.1/lib/promscrape/scraper.go:129 nothing changed in "/etc/victoriametrics-promscrape.yml"
2021-09-06T13:56:17.662Z info /home/builder/rpm/BUILD/VictoriaMetrics-pmm-6401-v1.53.1/lib/promscrape/scraper.go:122 SIGHUP received; reloading Prometheus configs from "/etc/victoriametrics-promscrape.yml"
2021-09-06T13:56:17.663Z info /home/builder/rpm/BUILD/VictoriaMetrics-pmm-6401-v1.53.1/lib/promscrape/scraper.go:129 nothing changed in "/etc/victoriametrics-promscrape.yml"
2021-09-06T13:56:19.343Z info /home/builder/rpm/BUILD/VictoriaMetrics-pmm-6401-v1.53.1/lib/promscrape/scraper.go:122 SIGHUP received; reloading Prometheus configs from "/etc/victoriametrics-promscrape.yml"
2021-09-06T13:56:19.344Z info /home/builder/rpm/BUILD/VictoriaMetrics-pmm-6401-v1.53.1/lib/promscrape/scraper.go:129 nothing changed in "/etc/victoriametrics-promscrape.yml"
2021-09-06T13:56:21.164Z info /home/builder/rpm/BUILD/VictoriaMetrics-pmm-6401-v1.53.1/lib/promscrape/scraper.go:122 SIGHUP received; reloading Prometheus configs from "/etc/victoriametrics-promscrape.yml"
2021-09-06T13:56:21.165Z info /home/builder/rpm/BUILD/VictoriaMetrics-pmm-6401-v1.53.1/lib/promscrape/scraper.go:129 nothing changed in "/etc/victoriametrics-promscrape.yml"
2021-09-06T13:56:22.843Z info /home/builder/rpm/BUILD/VictoriaMetrics-pmm-6401-v1.53.1/lib/promscrape/scraper.go:122 SIGHUP received; reloading Prometheus configs from "/etc/victoriametrics-promscrape.yml"
2021-09-06T13:56:22.844Z info /home/builder/rpm/BUILD/VictoriaMetrics-pmm-6401-v1.53.1/lib/promscrape/scraper.go:129 nothing changed in "/etc/victoriametrics-promscrape.yml"
Hi @Fan thanks for posting this - VictoriaMetrics daemon should not be reloading every second, unless your environment is that volatile with new instances being set up every second or so.
Can you please check the uptime on the victoriametrics process in the container:
docker exec -it pmm-server -- supervisorctl status
We’re also escalating this to the Engineering team - thanks again @Fan for your help
# docker exec -it pmm-server supervisorctl status
alertmanager RUNNING pid 28, uptime 15:04:00
clickhouse RUNNING pid 22, uptime 15:04:00
cron RUNNING pid 25, uptime 15:04:00
dashboard-upgrade EXITED Sep 06 10:34 AM
dbaas-controller STOPPED Not started
grafana RUNNING pid 29517, uptime 12:02:48
nginx RUNNING pid 24, uptime 15:04:00
pmm-agent RUNNING pid 32, uptime 15:04:00
pmm-managed RUNNING pid 183, uptime 15:03:58
pmm-update-perform STOPPED Not started
postgresql RUNNING pid 21, uptime 15:04:00
prometheus STOPPED Not started
qan-api2 RUNNING pid 852, uptime 15:03:44
victoriametrics RUNNING pid 26, uptime 15:04:00
vmalert RUNNING pid 27, uptime 15:04:00
You mentioned that “VictoriaMetrics daemon should not be reloading every second, unless your environment is that volatile with new instances being set up every second or so”
I wonder if that has something to do with the fact that the agent is constantly reporting errors and restarts? I’ve found that some server agents crash and reboot frequently
Sep 7 09:54:19 bj2-mysql-infra-prod-01 pmm-agent: panic: runtime error: index out of range [54] with length 54
Sep 7 09:54:19 bj2-mysql-infra-prod-01 pmm-agent: goroutine 454 [running]:
Sep 7 09:54:19 bj2-mysql-infra-prod-01 pmm-agent: github.com/percona/go-mysql/query.Fingerprint(0xc000640c80, 0x36, 0x142582d, 0x1a)
Sep 7 09:54:19 bj2-mysql-infra-prod-01 pmm-agent: /tmp/go/pkg/mod/github.com/percona/go-mysql@v0.0.0-20200630114833-b77f37c0bfa2/query/query.go:559 +0x4edb
Sep 7 09:54:19 bj2-mysql-infra-prod-01 pmm-agent: github.com/percona/pmm-agent/agents/mysql/slowlog.(*SlowLog).processFile(0xc0006cc138, 0x1631bb0, 0xc000a08240, 0xc000a0e220, 0x1e, 0x3ff0000000000000, 0x0, 0x0)
Sep 7 09:54:19 bj2-mysql-infra-prod-01 pmm-agent: /tmp/go/src/github.com/percona/pmm-agent/agents/mysql/slowlog/slowlog.go:347 +0x11c5
Sep 7 09:54:19 bj2-mysql-infra-prod-01 pmm-agent: github.com/percona/pmm-agent/agents/mysql/slowlog.(*SlowLog).Run.func3(0xc0006cc138, 0xc000a0e220, 0x1e, 0xc000a0c660, 0x1631bb0, 0xc000a08240, 0x3ff0000000000000)
Sep 7 09:54:19 bj2-mysql-infra-prod-01 pmm-agent: /tmp/go/src/github.com/percona/pmm-agent/agents/mysql/slowlog/slowlog.go:141 +0xf3
Sep 7 09:54:19 bj2-mysql-infra-prod-01 pmm-agent: created by github.com/percona/pmm-agent/agents/mysql/slowlog.(*SlowLog).Run
Sep 7 09:54:19 bj2-mysql-infra-prod-01 pmm-agent: /tmp/go/src/github.com/percona/pmm-agent/agents/mysql/slowlog/slowlog.go:139 +0x2be
Sep 7 09:54:19 bj2-mysql-infra-prod-01 systemd: pmm-agent.service: main process exited, code=exited, status=2/INVALIDARGUMENT
Sep 7 09:54:19 bj2-mysql-infra-prod-01 systemd: Unit pmm-agent.service entered failed state.
Sep 7 09:54:19 bj2-mysql-infra-prod-01 systemd: pmm-agent.service failed.
Hi @Fan,
Yeah, you are right.
CPU usage might be high because of some indexing processes inside victoriametrics, it should go down soon.
Victoria metrics reloading every second because of the agent constantly restarting.
Could you check your slowlog file, please.
Does it have wrong queries? E.g. truncated queries.
Hi @ nurlan thanks for your replay
I’ve tried checking the slowlog, but it’s really a bit difficult… I didn’t find any unusual statements, or at least I found that using pt-query-digest to analyse these slow log files was fine. I am using Percona 5.7.29/32
Could the agent be more robust, and if it encounters a log it can’t parse, skip it and output the warning to the log. I feel better than rebooting
Here are my slow query related parameters
log_timestamps = SYSTEM
log_output = file
slow-query-log-file = /data/mysql_3306/logs/slow.log
slow_query_log = ON
long_query_time = 0
log_slow_rate_limit = 100
log_slow_rate_type = query
log_slow_verbosity = full
log_slow_admin_statements = ON
log_slow_slave_statements = ON
slow_query_log_always_write_time = 1
slow_query_log_use_global_control = all
I think even if there is a truncated query statement, the agent should be able to handle such exceptions instead of restarting
here’s the slow log file and /var/log/message. Hope this helps for debug
https://www.swisstransfer.com/d/07d6ecbb-1715-42b2-912e-7a8fac6a044b
Thank you,
I’ll check logs and if the problem in parser I’ll create task to skip query if we can’t parse it.
Hi @nurlan, is there any progress?
Hi @Fan
Seems we have few bugs in the Victoria Metrics and some are related to CPU and Memory, currently there is a version upgradation planned for the next release as per Jira ticket.
@Naresh9999
got it, thank you!
Hi @Fan,
I couldn’t find any unusual lines there. Could you try with the latest PMM, please? We have fixed a few bugs related to slowlog. They might be related to your one.