Victoriametrics cpu usage looks much higher than prometheus

I recently migrated my database monitoring to a new set of pmm2.18 (from 2.5). Monitoring over 600 instances, cpu usage was around 40% (8 core) at pmm2.5. After migrating to pmm2.18, the cpu usage is 100% with a 12 core server. I didn’t even open the grafana page …

Few thoughts:

  • May be you could try to upgrade to the latest available 2.21
  • Did you also update the pmm-client on each of the 600 nodes? There are significant changes between 2.5 and 2.21 so definitely the clients must be upgraded too.

Had similar issue in the past, I did manage to fix it with removing and adding the nodes with same name (to preserve the data points). In my case I had low count of nodes so it was relatively easy to accomplish. I did use pmm-admin config --metrics-mode=push --force ... to re-add each node.

In any case I can recommend you to try and upgrade to the latest available version both for pmm-server and pmm-clients. Also check out here, as it’s quite possible someone do provide further details with more in depth knowledge on the subject.

3 Likes

Hi, thanks for your reply @ browseman, my pmm-client and server are both at 2.18. I’m not lucky, if 2.21 had come out earlier I would have upgraded to 2.21, because the clickhuse in 2.21 was upgraded too (I have a need for the window function), now I think I’ll be tired if I upgrade to 2.21, I just upgraded my server to 32 cores, now the cpu usage is around 45%

I found that some instances of the pmm-agent kept reporting errors
eg

Sep  6 21:04:52 bj2-mysql-other-prod-02 pmm-agent: #033[36mINFO#033[0m[2021-09-06T21:04:52.230+08:00] 2021-09-06T13:04:52.230Z#011error#011VictoriaMetrics/lib/promscrape/scrapework.go:231#011error when scraping "http://127.0.0.1:42017/metrics?collect%5B%5D=custom_query.hr&collect%5B%5D=global_status&collect%5B%5D=info_schema.innodb_metrics&collect%5B%5D=standard.go&collect%5B%5D=standard.process" from job "mysqld_exporter_agent_id_f68bdaac-6bd6-491a-b61a-dbae5e64270f_hr-5s" with labels {agent_id="/agent_id/f68bdaac-6bd6-491a-b61a-dbae5e64270f",agent_type="mysqld_exporter",cluster="mha_rd_ycms",env="prod",environment="prod",instance="/agent_id/f68bdaac-6bd6-491a-b61a-dbae5e64270f",job="mysqld_exporter_agent_id_f68bdaac-6bd6-491a-b61a-dbae5e64270f_hr-5s",machine_id="/machine_id/4284cfe27c5b48c38adde30f33b4bb60",node_id="/node_id/65204015-8e53-4754-b0c9-c95f157d92df",node_name="bj2-mysql-other-prod-02",node_type="generic",region="bj",replication_set="mha_rd_ycms",service_id="/service_id/66885a71-cf22-400b-91dc-27312f6193de",service_name="bj2-mysql-other-prod-02_3311",service_type="mysql"}: error when scraping "http://127.0.0.1:42017/metrics?collect%5B%5D=custom_query.hr&collect%5B%5D=global_status&collect%5B%5D=info_schema.innodb_metrics&collect%5B%5D=standard.go&collect%5B%5D=standard.process": dial tcp4 127.0.0.1:42017: connect: connection refused; try -enableTCP6 command-line flag if you scrape ipv6 addresses  #033[36magentID#033[0m=/agent_id/381cc441-e286-494e-9be9-b74ddee74fd7 #033[36mcomponent#033[0m=agent-process #033[36mtype#033[0m=vm_agent
Sep  6 21:04:52 bj2-mysql-other-prod-02 pmm-agent: #033[36mINFO#033[0m[2021-09-06T21:04:52.929+08:00] 2021-09-06T13:04:52.929Z#011error#011VictoriaMetrics/lib/promscrape/scrapework.go:231#011error when scraping "http://127.0.0.1:42017/metrics?collect%5B%5D=custom_query.mr&collect%5B%5D=engine_innodb_status&collect%5B%5D=info_schema.innodb_cmp&collect%5B%5D=info_schema.innodb_cmpmem&collect%5B%5D=info_schema.processlist&collect%5B%5D=info_schema.query_response_time&collect%5B%5D=perf_schema.eventswaits&collect%5B%5D=perf_schema.file_events&collect%5B%5D=perf_schema.tablelocks&collect%5B%5D=slave_status" from job "mysqld_exporter_agent_id_f68bdaac-6bd6-491a-b61a-dbae5e64270f_mr-10s" with labels {agent_id="/agent_id/f68bdaac-6bd6-491a-b61a-dbae5e64270f",agent_type="mysqld_exporter",cluster="mha_rd_ycms",env="prod",environment="prod",instance="/agent_id/f68bdaac-6bd6-491a-b61a-dbae5e64270f",job="mysqld_exporter_agent_id_f68bdaac-6bd6-491a-b61a-dbae5e64270f_mr-10s",machine_id="/machine_id/4284cfe27c5b48c38adde30f33b4bb60",node_id="/node_id/65204015-8e53-4754-b0c9-c95f157d92df",node_name="bj2-mysql-other-prod-02",node_type="generic",region="bj",replication_set="mha_rd_ycms",service_id="/service_id/66885a71-cf22-400b-91dc-27312f6193de",service_name="bj2-mysql-other-prod-02_3311",service_type="mysql"}: error when scraping "http://127.0.0.1:42017/metrics?collect%5B%5D=custom_query.mr&collect%5B%5D=engine_innodb_status&collect%5B%5D=info_schema.innodb_cmp&collect%5B%5D=info_schema.innodb_cmpmem&collect%5B%5D=info_schema.processlist&collect%5B%5D=info_schema.query_response_time&collect%5B%5D=perf_schema.eventswaits&collect%5B%5D=perf_schema.file_events&collect%5B%5D=perf_schema.tablelocks&collect%5B%5D=slave_status": dial tcp4 127.0.0.1:42017: connect: connection refused; try -enableTCP6 command-line flag if you scrape ipv6 addresses  #033[36magentID#033[0m=/agent_id/381cc441-e286-494e-9be9-b74ddee74fd7 #033[36mcomponent#033[0m=agent-process #033[36mtype#033[0m=vm_agent
3[0m=/agent_id/b6e72b0b-e94c-410e-b4cf-31edbcd911f9 #033[36mcomponent#033[0m=agent-process #033[36mtype#033[0m=mysqld_exporter
Sep  6 21:07:44 bj2-mysql-huoyun2-prod-01 pmm-agent: panic: runtime error: index out of range [54] with length 54
Sep  6 21:07:44 bj2-mysql-huoyun2-prod-01 pmm-agent: goroutine 546 [running]:
Sep  6 21:07:44 bj2-mysql-huoyun2-prod-01 pmm-agent: github.com/percona/go-mysql/query.Fingerprint(0xc000476000, 0x36, 0x142582d, 0x1a)
Sep  6 21:07:44 bj2-mysql-huoyun2-prod-01 pmm-agent: /tmp/go/pkg/mod/github.com/percona/go-mysql@v0.0.0-20200630114833-b77f37c0bfa2/query/query.go:559 +0x4edb
Sep  6 21:07:44 bj2-mysql-huoyun2-prod-01 pmm-agent: github.com/percona/pmm-agent/agents/mysql/slowlog.(*SlowLog).processFile(0xc0004d2270, 0x1631bb0, 0xc0005e0440, 0xc0006152e0, 0x1e, 0x3ff0000000000000, 0x0, 0x0)
Sep  6 21:07:44 bj2-mysql-huoyun2-prod-01 pmm-agent: /tmp/go/src/github.com/percona/pmm-agent/agents/mysql/slowlog/slowlog.go:347 +0x11c5
Sep  6 21:07:44 bj2-mysql-huoyun2-prod-01 pmm-agent: github.com/percona/pmm-agent/agents/mysql/slowlog.(*SlowLog).Run.func3(0xc0004d2270, 0xc0006152e0, 0x1e, 0xc0004ee120, 0x1631bb0, 0xc0005e0440, 0x3ff0000000000000)
Sep  6 21:07:44 bj2-mysql-huoyun2-prod-01 pmm-agent: /tmp/go/src/github.com/percona/pmm-agent/agents/mysql/slowlog/slowlog.go:141 +0xf3
Sep  6 21:07:44 bj2-mysql-huoyun2-prod-01 pmm-agent: created by github.com/percona/pmm-agent/agents/mysql/slowlog.(*SlowLog).Run
Sep  6 21:07:44 bj2-mysql-huoyun2-prod-01 pmm-agent: /tmp/go/src/github.com/percona/pmm-agent/agents/mysql/slowlog/slowlog.go:139 +0x2be
Sep  6 21:07:44 bj2-mysql-huoyun2-prod-01 systemd: pmm-agent.service: main process exited, code=exited, status=2/INVALIDARGUMENT
Sep  6 21:07:44 bj2-mysql-huoyun2-prod-01 systemd: Unit pmm-agent.service entered failed state.
Sep  6 21:07:44 bj2-mysql-huoyun2-prod-01 systemd: pmm-agent.service failed.

My mysql is a single multi-instance deployment, and the pmm-agent output logs are now a bit large and worthless, resulting in my /var/log/message file being too large

Is this log normal(/srv/logs/victoriametrics.log), reloading every second?

21-09-06T13:56:07.919Z        info    /home/builder/rpm/BUILD/VictoriaMetrics-pmm-6401-v1.53.1/lib/promscrape/scraper.go:129  nothing changed in "/etc/victoriametrics-promscrape.yml"
2021-09-06T13:56:09.487Z        info    /home/builder/rpm/BUILD/VictoriaMetrics-pmm-6401-v1.53.1/lib/promscrape/scraper.go:122  SIGHUP received; reloading Prometheus configs from "/etc/victoriametrics-promscrape.yml"
2021-09-06T13:56:09.488Z        info    /home/builder/rpm/BUILD/VictoriaMetrics-pmm-6401-v1.53.1/lib/promscrape/scraper.go:129  nothing changed in "/etc/victoriametrics-promscrape.yml"
2021-09-06T13:56:11.108Z        info    /home/builder/rpm/BUILD/VictoriaMetrics-pmm-6401-v1.53.1/lib/promscrape/scraper.go:122  SIGHUP received; reloading Prometheus configs from "/etc/victoriametrics-promscrape.yml"
2021-09-06T13:56:11.109Z        info    /home/builder/rpm/BUILD/VictoriaMetrics-pmm-6401-v1.53.1/lib/promscrape/scraper.go:129  nothing changed in "/etc/victoriametrics-promscrape.yml"
2021-09-06T13:56:12.734Z        info    /home/builder/rpm/BUILD/VictoriaMetrics-pmm-6401-v1.53.1/lib/promscrape/scraper.go:122  SIGHUP received; reloading Prometheus configs from "/etc/victoriametrics-promscrape.yml"
2021-09-06T13:56:12.735Z        info    /home/builder/rpm/BUILD/VictoriaMetrics-pmm-6401-v1.53.1/lib/promscrape/scraper.go:129  nothing changed in "/etc/victoriametrics-promscrape.yml"
2021-09-06T13:56:14.345Z        info    /home/builder/rpm/BUILD/VictoriaMetrics-pmm-6401-v1.53.1/lib/promscrape/scraper.go:122  SIGHUP received; reloading Prometheus configs from "/etc/victoriametrics-promscrape.yml"
2021-09-06T13:56:14.346Z        info    /home/builder/rpm/BUILD/VictoriaMetrics-pmm-6401-v1.53.1/lib/promscrape/scraper.go:129  nothing changed in "/etc/victoriametrics-promscrape.yml"
2021-09-06T13:56:17.662Z        info    /home/builder/rpm/BUILD/VictoriaMetrics-pmm-6401-v1.53.1/lib/promscrape/scraper.go:122  SIGHUP received; reloading Prometheus configs from "/etc/victoriametrics-promscrape.yml"
2021-09-06T13:56:17.663Z        info    /home/builder/rpm/BUILD/VictoriaMetrics-pmm-6401-v1.53.1/lib/promscrape/scraper.go:129  nothing changed in "/etc/victoriametrics-promscrape.yml"
2021-09-06T13:56:19.343Z        info    /home/builder/rpm/BUILD/VictoriaMetrics-pmm-6401-v1.53.1/lib/promscrape/scraper.go:122  SIGHUP received; reloading Prometheus configs from "/etc/victoriametrics-promscrape.yml"
2021-09-06T13:56:19.344Z        info    /home/builder/rpm/BUILD/VictoriaMetrics-pmm-6401-v1.53.1/lib/promscrape/scraper.go:129  nothing changed in "/etc/victoriametrics-promscrape.yml"
2021-09-06T13:56:21.164Z        info    /home/builder/rpm/BUILD/VictoriaMetrics-pmm-6401-v1.53.1/lib/promscrape/scraper.go:122  SIGHUP received; reloading Prometheus configs from "/etc/victoriametrics-promscrape.yml"
2021-09-06T13:56:21.165Z        info    /home/builder/rpm/BUILD/VictoriaMetrics-pmm-6401-v1.53.1/lib/promscrape/scraper.go:129  nothing changed in "/etc/victoriametrics-promscrape.yml"
2021-09-06T13:56:22.843Z        info    /home/builder/rpm/BUILD/VictoriaMetrics-pmm-6401-v1.53.1/lib/promscrape/scraper.go:122  SIGHUP received; reloading Prometheus configs from "/etc/victoriametrics-promscrape.yml"
2021-09-06T13:56:22.844Z        info    /home/builder/rpm/BUILD/VictoriaMetrics-pmm-6401-v1.53.1/lib/promscrape/scraper.go:129  nothing changed in "/etc/victoriametrics-promscrape.yml"

Hi @Fan thanks for posting this - VictoriaMetrics daemon should not be reloading every second, unless your environment is that volatile with new instances being set up every second or so.

Can you please check the uptime on the victoriametrics process in the container:
docker exec -it pmm-server -- supervisorctl status
We’re also escalating this to the Engineering team - thanks again @Fan for your help :slight_smile:

1 Like

Hi @Michael_Coburn

# docker exec -it pmm-server supervisorctl status
alertmanager                     RUNNING   pid 28, uptime 15:04:00
clickhouse                       RUNNING   pid 22, uptime 15:04:00
cron                             RUNNING   pid 25, uptime 15:04:00
dashboard-upgrade                EXITED    Sep 06 10:34 AM
dbaas-controller                 STOPPED   Not started
grafana                          RUNNING   pid 29517, uptime 12:02:48
nginx                            RUNNING   pid 24, uptime 15:04:00
pmm-agent                        RUNNING   pid 32, uptime 15:04:00
pmm-managed                      RUNNING   pid 183, uptime 15:03:58
pmm-update-perform               STOPPED   Not started
postgresql                       RUNNING   pid 21, uptime 15:04:00
prometheus                       STOPPED   Not started
qan-api2                         RUNNING   pid 852, uptime 15:03:44
victoriametrics                  RUNNING   pid 26, uptime 15:04:00
vmalert                          RUNNING   pid 27, uptime 15:04:00
1 Like

You mentioned that “VictoriaMetrics daemon should not be reloading every second, unless your environment is that volatile with new instances being set up every second or so”

I wonder if that has something to do with the fact that the agent is constantly reporting errors and restarts? I’ve found that some server agents crash and reboot frequently

Sep  7 09:54:19 bj2-mysql-infra-prod-01 pmm-agent: panic: runtime error: index out of range [54] with length 54
Sep  7 09:54:19 bj2-mysql-infra-prod-01 pmm-agent: goroutine 454 [running]:
Sep  7 09:54:19 bj2-mysql-infra-prod-01 pmm-agent: github.com/percona/go-mysql/query.Fingerprint(0xc000640c80, 0x36, 0x142582d, 0x1a)
Sep  7 09:54:19 bj2-mysql-infra-prod-01 pmm-agent: /tmp/go/pkg/mod/github.com/percona/go-mysql@v0.0.0-20200630114833-b77f37c0bfa2/query/query.go:559 +0x4edb
Sep  7 09:54:19 bj2-mysql-infra-prod-01 pmm-agent: github.com/percona/pmm-agent/agents/mysql/slowlog.(*SlowLog).processFile(0xc0006cc138, 0x1631bb0, 0xc000a08240, 0xc000a0e220, 0x1e, 0x3ff0000000000000, 0x0, 0x0)
Sep  7 09:54:19 bj2-mysql-infra-prod-01 pmm-agent: /tmp/go/src/github.com/percona/pmm-agent/agents/mysql/slowlog/slowlog.go:347 +0x11c5
Sep  7 09:54:19 bj2-mysql-infra-prod-01 pmm-agent: github.com/percona/pmm-agent/agents/mysql/slowlog.(*SlowLog).Run.func3(0xc0006cc138, 0xc000a0e220, 0x1e, 0xc000a0c660, 0x1631bb0, 0xc000a08240, 0x3ff0000000000000)
Sep  7 09:54:19 bj2-mysql-infra-prod-01 pmm-agent: /tmp/go/src/github.com/percona/pmm-agent/agents/mysql/slowlog/slowlog.go:141 +0xf3
Sep  7 09:54:19 bj2-mysql-infra-prod-01 pmm-agent: created by github.com/percona/pmm-agent/agents/mysql/slowlog.(*SlowLog).Run
Sep  7 09:54:19 bj2-mysql-infra-prod-01 pmm-agent: /tmp/go/src/github.com/percona/pmm-agent/agents/mysql/slowlog/slowlog.go:139 +0x2be
Sep  7 09:54:19 bj2-mysql-infra-prod-01 systemd: pmm-agent.service: main process exited, code=exited, status=2/INVALIDARGUMENT
Sep  7 09:54:19 bj2-mysql-infra-prod-01 systemd: Unit pmm-agent.service entered failed state.
Sep  7 09:54:19 bj2-mysql-infra-prod-01 systemd: pmm-agent.service failed.
1 Like

Hi @Fan,

Yeah, you are right.
CPU usage might be high because of some indexing processes inside victoriametrics, it should go down soon.
Victoria metrics reloading every second because of the agent constantly restarting.

Could you check your slowlog file, please.
Does it have wrong queries? E.g. truncated queries.

2 Likes

Hi @ nurlan thanks for your replay

I’ve tried checking the slowlog, but it’s really a bit difficult… I didn’t find any unusual statements, or at least I found that using pt-query-digest to analyse these slow log files was fine. I am using Percona 5.7.29/32

Could the agent be more robust, and if it encounters a log it can’t parse, skip it and output the warning to the log. I feel better than rebooting

Here are my slow query related parameters

log_timestamps                              = SYSTEM
log_output                                  = file
slow-query-log-file                         = /data/mysql_3306/logs/slow.log
slow_query_log                              = ON
long_query_time                             = 0
log_slow_rate_limit                         = 100
log_slow_rate_type                          = query
log_slow_verbosity                          = full
log_slow_admin_statements                   = ON
log_slow_slave_statements                   = ON
slow_query_log_always_write_time            = 1
slow_query_log_use_global_control           = all

I think even if there is a truncated query statement, the agent should be able to handle such exceptions instead of restarting

here’s the slow log file and /var/log/message. Hope this helps for debug
https://www.swisstransfer.com/d/07d6ecbb-1715-42b2-912e-7a8fac6a044b

1 Like

Thank you,
I’ll check logs and if the problem in parser I’ll create task to skip query if we can’t parse it.

2 Likes

Hi @nurlan, is there any progress?

1 Like

Hi @Fan

Seems we have few bugs in the Victoria Metrics and some are related to CPU and Memory, currently there is a version upgradation planned for the next release as per Jira ticket.

https://jira.percona.com/browse/PMM-9358

1 Like

@Naresh9999
got it, thank you!

2 Likes

Hi @Fan,
I couldn’t find any unusual lines there. Could you try with the latest PMM, please? We have fixed a few bugs related to slowlog. They might be related to your one.

1 Like