my graphs in Grafana sometimes working but sometimes not

I installed PMM 1.0.4 yesterday. From the beginning when there were only one or two test hosts, the graphs were still working. But later they were not working when I added several hosts into pmm-server one by one.

pmm-admin list

RUNNING is all YES

pmm-admin check-network --no-emoji

  • Client --> Server all is OK
  • Client <-- Server all is PROBLEM ( I already checked the firewall which was stopped.)

The following is one of my graphs.

Please give me any advice, thanks so much. BTW, PMM is so cool:P

Watch this page http://server/prometheus/targets to see endpoint status.
Also you can check the log file by entering container “docker exec -ti pmm-server bash”, then “vi /var/log/prometheus.log”.

This usually happen when there is a network latency between server and clients.

Another thing you can test whether 1s resolution is not too much for given system resources for monitor server (where container runs) and network latency.
You can try 5s and see if it works better https://www.percona.com/doc/percona-monitoring-and-management/faq.html#what-resolution-is-used-for-metrics

I found the following error in prometheus.log when graphs were not working.

time=“2016-09-28T08:13:57Z” level=error msg=“Storage needs throttling. Scrapes and rule evaluations will be skipped.” chunksToPersist=78816 maxChunksToPersist=524288 maxToleratedMemChunks=288358 memoryChunks=300294 source=“storage.go:707”

this afternoon I also tried to change the settings of prometheus.yml, such as scrape_intervals , scrape_timeout. And then to restart pmm-server the above issue is still there. :frowning:

As your mention I added the option, the output is the following…(docker create met the same issue)

docker run -d -p 80:80 -m METRICS_RESOLUTION=5s --volumes-from pmm-data2 --name pmm-server2 --restart always percona/pmm-server:1.0.4

docker: invalid size: ‘METRICS_RESOLUTION=5s’.
See ‘docker run --help’.

How many endpoints do you have in Prometheus? Or time series? (Prometheus dashboard).
May be it’s not enough memory 256M dedicated to Prometheus https://www.percona.com/doc/percona-monitoring-and-management/faq.html#how-to-control-memory-consumption-for-prometheus

Morning Roman…thanks so much for your nice hints.

Yesterday I removed all hosts from pmm-server and then added 5 new hosts back to pmm-server. Until now all graphs are working. As you said I went to prometheus/targets and found all endpoints whose state are UP except only one (42002/metrics-lr) that is DOWN, error is context deadline exceeded.

My PMM server is a virtual machine with 4G Ram, 2core. Before there was mysqld running on it. I already stopped it yesterday. I am not sure whether it is not enough resource for prometheus.

Otherwise, what does these metrics mean, metrics-hr, metrics-mr and metrics-lr ?

metrics-hr - 1s resolution metrics
metrics-mr - 5s resolution metrics
metrics-lr - 60s resolution metrics

metrics-lr includes global variables and more intensive stats like table stats, user stats etc.

“pmm-admin add mysql --help” has the following flags:
–disable-binlogstats disable binlog statistics
–disable-processlist disable process state metrics
–disable-tablestats disable table statistics (disabled automatically with 10000+ tables)
–disable-userstats disable user statistics

How many tables do you have? SELECT COUNT(*) FROM information_schema.tables

For 5 hosts I recommend to bump Prometheus memory to 1024M as you say VM has 4G.

Almost 5000 tables are there in the instance.

Aye, I already start to learn prometheus, which is such a huge system and powerful…

Looks like 5000 tables is still a lot to return various metrics on each. Disabling table stats (re-adding mysql:metrics with --disable-tablestats) should make mysql-lr job up.

You are right Roman. metrics-lr now is up with --disable-tablestats. Thanks a lot!

Thanks for checking, I think we should lower the count of tables when table stats is disabled automatically.

That should be nice.

Roman, anther issue happened again on mongodb graph :frowning: I remember that the first time adding one mongodb server, all graphs were working… Today I also tried to add the mongo server to pmm server, but not all of graphs is working, such as command operations sec, document operations, getLastError-xxx, oplog insert time, Memory fault … no graph.

And then I went to http://server/prometheus/graph, manually executed the metrics. I could get values. Please give some advice.

Thanks.

If you added mongodb instance w/o nodetype, replset flags etc. then you should see the graphs only on Standalone instance dashboard. We plan to make nodetype and replset auto-discovered so this is not needed.

I already added --replset repset --nodetype mongod --uri mongodb://xxxx

Before I could get all graphs on ReplSet type. Right now the above mentioned graphs are empty both on Standalone instance and Replica set.