A have installed PMM 1.05 and adding some hosts with linux:metrics and mysql:metrics, but Grafana is showing the graph very sporadic.
In times i have no graphs the client state is down.
SERVICE TYPE NAME REMOTE ENDPOINT STATUS
-------------- ------- ---------------------- -------
linux:metrics db-1-1 144.76.15.145:42000 DOWN
mysql:metrics db-1-1 144.76.15.145:42002 DOWN
I run pmm-server as following to prevent sync issues.
PING db-1-1 (144.76.15.145) 56(84) bytes of data.
64 bytes from db-1-1 (x.x.x.x): icmp_seq=1 ttl=60 time=0.311 ms
64 bytes from db-1-1 (x.x.x.x): icmp_seq=2 ttl=60 time=0.333 ms
64 bytes from db-1-1 (x.x.x.x): icmp_seq=3 ttl=60 time=0.684 ms
Server side: Wed Oct 26 10:56:51 UTC 2016
Client Side: Wed Oct 26 12:56:51 CEST 2016
If i take a look in /var/log/prometheus.log there are no error entries. Sometimes, there are absolute no entries over a few hours.
I have been having this (or similar issue) and it seems to be that it runs out of memory to ingest the metrics coming in to prometheus.
And usually when it dies it really dies and I have to restart the docker.
I have been bumping up the memory with this command:
METRICS_MEMORY=2097152
In the docker run command line. It has some fairly small default - that was working fine until I added one too many servers or has a peak in events.
You might take a look at the Prometheus graph in grafana.
My running server looks like this when I run docker stats pmm-server
CONTAINER CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O
pmm-server 71.24% 6.77 GB / 7.934 GB 85.33% 6.932 GB / 379.2 MB 3.766 GB / 18.15 GB
… adding slowly monitored hosts to the pmm server.
After doing this for the first 9 server only 3 minutes after adding number 10 the monitoring stopped completely.
Again there are no log entries in /var/log/prometheus.log and ALL metrics on ALL servers are in running state, but without connectivity to the server:
SERVICE TYPE NAME REMOTE ENDPOINT STATUS
-------------- ------- ---------------------- -------
linux:metrics db-1-2 x.x.x.x:42000 DOWN
mysql:metrics db-1-2 x.x.x.x:42002 DOWN
PMM is a nice and helpfull tool, but it seems not very robust and we are not able to work with it at the moment.
process_cpu_seconds_total 23.42
# HELP process_max_fds Maximum number of open file descriptors.
# TYPE process_max_fds gauge
process_max_fds 1024
# HELP process_open_fds Number of open file descriptors.
# TYPE process_open_fds gauge
process_open_fds 9
# HELP process_resident_memory_bytes Resident memory size in bytes.
# TYPE process_resident_memory_bytes gauge
process_resident_memory_bytes 1.370112e+07
# HELP process_start_time_seconds Start time of the process since unix epoch in seconds.
# TYPE process_start_time_seconds gauge
process_start_time_seconds 1.47749268343e+09
# HELP process_virtual_memory_bytes Virtual memory size in bytes.
# TYPE process_virtual_memory_bytes gauge
process_virtual_memory_bytes 2.7041792e+07
...
Since my last post i got 3 times data without changing anything on the infrastructure.