PMM MySQL low-res exporter : context deadline exceeded

Hello,

I have a (mostly) working PMM2 (2.10.0) install. However I noticed that some dashboard seem mostly empty or have very few metrics (lots of gaps).

After investigation it seems that all the “low-res” prometheus target are taking too long to be scraped (seen on “Unealthy” on http://pmmserver/prometheus/targets page), for example :

Get "http://devdbserver:42000/metrics?collect%5B%5D=binlog_size&collect%5B%5D=custom_query.lr&collect%5B%5D=engine_tokudb_status&collect%5B%5D=global_variables&collect%5B%5D=heartbeat&collect%5B%5D=info_schema.clientstats&collect%5B%5D=info_schema.innodb_tablespaces&collect%5B%5D=info_schema.userstats&collect%5B%5D=perf_schema.eventsstatements&collect%5B%5D=perf_schema.file_instances": context deadline exceeded

This is a dev server on which we have “quite some” tables (which is nowhere near what is on production servers) :

# find /var/lib/mysql -name '*.ibd' | wc -l
98301

Getting the metrics locally (to avoid possible network issues) takes around 17s :

# time curl http://localhost:42000/metrics-lr -u 'pmm:/agent_id/********' > /tmp/out.txt                                                                           <(14:56:13)>
 % Total  % Received % Xferd Average Speed  Time  Time   Time Current
                 Dload Upload  Total  Spent  Left Speed
100 145M  0 145M  0   0 8585k   0 --:--:-- 0:00:17 --:--:-- 35.3M
curl http://localhost:42000/metrics-lr -u  > /tmp/out.txt  0.02s user 0.14s system 0% cpu 17.394 total

The metrics are 145M for around 1M lines.

Most represented metrics are the following :

$ grep -v '^#' /tmp/out.txt | cut -f1 -d'{' | sort | uniq -c | sort -h | tail
    250 mysql_perf_schema_events_statements_sort_rows_total
    250 mysql_perf_schema_events_statements_tmp_disk_tables_total
    250 mysql_perf_schema_events_statements_tmp_tables_total
    250 mysql_perf_schema_events_statements_total
    250 mysql_perf_schema_events_statements_warnings_total
  98145 mysql_info_schema_innodb_tablespace_allocated_size_bytes
  98145 mysql_info_schema_innodb_tablespace_file_size_bytes
  98145 mysql_info_schema_innodb_tablespace_space_info
 396528 mysql_perf_schema_file_instances_bytes
 396528 mysql_perf_schema_file_instances_total

Is there a way to increase the timeout or maybe not export some of these metrics ?

(For the record HR and MR targets take less than 0.1s and 1s respectively from a remote server)

Hi babine,

We have created the following bug to track this some days ago: https://jira.percona.com/browse/PMM-6744, which is most likely what you are seeing. Can you double-check if collecting all but perf_schema.file_instances will make the curl command take less than 10 seconds for you too?

time curl -u 'pmm:/agent_id/********'  http://devdbserver:42000/metrics?collect%5B%5D=binlog_size&collect%5B%5D=custom_query.lr&collect%5B%5D=engine_tokudb_status&collect%5B%5D=global_variables&collect%5B%5D=heartbeat&collect%5B%5D=info_schema.clientstats&collect%5B%5D=info_schema.innodb_tablespaces&collect%5B%5D=info_schema.userstats&collect%5B%5D=perf_schema.eventsstatements >/dev/null

Best,

Agustín.

Thanks for the answer, good news !

On the dev server itself :

# time curl 'http://localhost:42000/metrics?collect%5B%5D=binlog_size&collect%5B%5D=custom_query.lr&collect%5B%5D=engine_tokudb_status&collect%5B%5D=global_variables&collect%5B%5D=heartbeat&collect%5B%5D=info_schema.clientstats&collect%5B%5D=info_schema.innodb_tablespaces&collect%5B%5D=info_schema.userstats&collect%5B%5D=perf_schema.eventsstatement' -u 'pmm:/agent_id/********' > /tmp/out.txt 
 % Total  % Received % Xferd Average Speed  Time  Time   Time Current
                 Dload Upload  Total  Spent  Left Speed
100 37.4M  0 37.4M  0   0 11.4M   0 --:--:-- 0:00:03 --:--:-- 11.4M
curl -u 'pmm:/agent_id/********' > /tmp/out.txt 0.00s user 0.05s system 1% cpu 3.336 total

It takes a bit more than 3 seconds for 38MB (locally)

I tried modifying the prometheus.yml file in the PMM server docker to change the timeout to 30 seconds but it was somehow reverted to 10 seconds upon restart.

Great! I suggest you to follow that JIRA ticket, then, to get the latest updates on when it will be resolved.

Regarding:

I tried modifying the prometheus.yml file in the PMM server docker to change the timeout to 30 seconds but it was somehow reverted to 10 seconds upon restart.

Unfortunately, there is a maximum scrape_timeout set to 10s globally. Even if you change the scrape_interval to something greater, this is currently the maximum allowed (and it will be overwritten automatically if you manually change the config file, as you have already noted). You can always create a new feature request, if you think it will be worth it.