Hello,
I have a (mostly) working PMM2 (2.10.0) install. However I noticed that some dashboard seem mostly empty or have very few metrics (lots of gaps).
After investigation it seems that all the “low-res” prometheus target are taking too long to be scraped (seen on “Unealthy” on http://pmmserver/prometheus/targets page), for example :
Get "http://devdbserver:42000/metrics?collect%5B%5D=binlog_size&collect%5B%5D=custom_query.lr&collect%5B%5D=engine_tokudb_status&collect%5B%5D=global_variables&collect%5B%5D=heartbeat&collect%5B%5D=info_schema.clientstats&collect%5B%5D=info_schema.innodb_tablespaces&collect%5B%5D=info_schema.userstats&collect%5B%5D=perf_schema.eventsstatements&collect%5B%5D=perf_schema.file_instances": context deadline exceeded
This is a dev server on which we have “quite some” tables (which is nowhere near what is on production servers) :
# find /var/lib/mysql -name '*.ibd' | wc -l
98301
Getting the metrics locally (to avoid possible network issues) takes around 17s :
# time curl http://localhost:42000/metrics-lr -u 'pmm:/agent_id/********' > /tmp/out.txt <(14:56:13)>
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 145M 0 145M 0 0 8585k 0 --:--:-- 0:00:17 --:--:-- 35.3M
curl http://localhost:42000/metrics-lr -u > /tmp/out.txt 0.02s user 0.14s system 0% cpu 17.394 total
The metrics are 145M for around 1M lines.
Most represented metrics are the following :
$ grep -v '^#' /tmp/out.txt | cut -f1 -d'{' | sort | uniq -c | sort -h | tail
250 mysql_perf_schema_events_statements_sort_rows_total
250 mysql_perf_schema_events_statements_tmp_disk_tables_total
250 mysql_perf_schema_events_statements_tmp_tables_total
250 mysql_perf_schema_events_statements_total
250 mysql_perf_schema_events_statements_warnings_total
98145 mysql_info_schema_innodb_tablespace_allocated_size_bytes
98145 mysql_info_schema_innodb_tablespace_file_size_bytes
98145 mysql_info_schema_innodb_tablespace_space_info
396528 mysql_perf_schema_file_instances_bytes
396528 mysql_perf_schema_file_instances_total
Is there a way to increase the timeout or maybe not export some of these metrics ?
(For the record HR and MR targets take less than 0.1s and 1s respectively from a remote server)