PMM does not load

Hi All,

We opted to use PMM, but I find that it regularly crashes, when I check the docker status every component shows running
When I say crashes I am unable to load the first page http://<hostname> and niether of the other pages

http://<hostname>
http://<hostname>/graph
http://<hostname>/qan

The docker log indicates that all components are running

2017-09-23 16:52:10,588 INFO success: dashboard-upgrade entered RUNNING state, process has stayed up for > than 0 seconds (startsecs)
2017-09-23 16:52:10,743 INFO exited: dashboard-upgrade (exit status 0; expected)
2017-09-23 16:52:12,394 INFO success: mysql entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2017-09-23 16:52:12,394 INFO success: consul entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2017-09-23 16:52:12,394 INFO success: grafana entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2017-09-23 16:52:12,394 INFO success: nginx entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2017-09-23 16:52:12,394 INFO success: cron entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2017-09-23 16:52:12,394 INFO success: qan-api entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2017-09-23 16:52:12,394 INFO success: prometheus entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2017-09-23 16:52:12,394 INFO success: createdb entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2017-09-23 16:52:12,394 INFO success: orchestrator entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2017-09-23 16:52:12,394 INFO success: node_exporter entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2017-09-23 16:52:12,395 INFO success: pmm-manage entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2017-09-23 16:52:15,602 INFO exited: createdb (exit status 0; expected)

Further probing the logfiles on the docker I get only this error in the Prometheus log file

time=“2017-09-23T16:52:11Z” level=info msg=“Listening on :9090” source=“web.go:259”
time=“2017-09-23T16:52:11Z” level=error msg=“Error refreshing service list: Get [url]http://localhost:8500/v1/catalog/services?dc=dc1&wait=30000ms:[/url] dial tcp 127.0.0.1:
8500: getsockopt: connection refused” source=“consul.go:168”
time=“2017-09-23T16:52:11Z” level=error msg=“Error refreshing service list: Get [url]http://localhost:8500/v1/catalog/services?dc=dc1&wait=30000ms:[/url] dial tcp 127.0.0.1:
8500: getsockopt: connection refused” source=“consul.go:168”
time=“2017-09-23T16:52:11Z” level=error msg=“Error refreshing service list: Get [url]http://localhost:8500/v1/catalog/services?dc=dc1&wait=30000ms:[/url] dial tcp 127.0.0.1:
8500: getsockopt: connection refused” source=“consul.go:168”
time=“2017-09-23T16:52:11Z” level=error msg=“Error refreshing service list: Get [url]http://localhost:8500/v1/catalog/services?dc=dc1&wait=30000ms:[/url] dial tcp 127.0.0.1:
8500: getsockopt: connection refused” source=“consul.go:168”
time=“2017-09-23T16:52:11Z” level=error msg=“Error refreshing service list: Get [url]http://localhost:8500/v1/catalog/services?dc=dc1&wait=30000ms:[/url] dial tcp 127.0.0.1:
8500: getsockopt: connection refused” source=“consul.go:168”
time=“2017-09-23T16:52:11Z” level=error msg=“Error refreshing service list: Get [url]http://localhost:8500/v1/catalog/services?dc=dc1&wait=30000ms:[/url] dial tcp 127.0.0.1:
8500: getsockopt: connection refused” source=“consul.go:168”
time=“2017-09-23T16:55:59Z” level=info msg=“Completed initial partial maintenance sweep through 248 in-memory fingerprints in 3m46.630478224s.” source=“storage.go
:1398”
time=“2017-09-23T16:57:11Z” level=info msg=“Checkpointing in-memory metrics and chunks…” source=“persistence.go:633”
time=“2017-09-23T16:57:11Z” level=info msg=“Done checkpointing in-memory metrics and chunks in 162.865568ms.” source=“persistence.go:665”


And this in the consul log file

b
62a46f2 vsn:2 build:unknown’‘:2c77151 port:8300 vsn_min:2 vsn_max:3 role:consul dc:dc1 raft_vsn:2] alive 1 5 2 2 5 4}
2017/09/23 17:10:17 [WARN] consul: skipping reconcile of node {2d4eb3808ef9 127.0.0.1 8301 map[vsn_max:3 role:consul dc:dc1 raft_vsn:2 port:8300 vsn_min:2 boo
tstrap:1 wan_join_port:8302 id:5807a948-eebf-0987-ca46-55fab62a46f2 vsn:2 build:unknown’‘:2c77151] alive 1 5 2 2 5 4}
2017/09/23 17:11:17 [WARN] consul: skipping reconcile of node {2d4eb3808ef9 127.0.0.1 8301 map[vsn_max:3 role:consul dc:dc1 raft_vsn:2 bootstrap:1 wan_join_po
rt:8302 id:5807a948-eebf-0987-ca46-55fab62a46f2 vsn:2 build:unknown’‘:2c77151 port:8300 vsn_min:2] alive 1 5 2 2 5 4}
2017/09/23 17:12:17 [WARN] consul: skipping reconcile of node {2d4eb3808ef9 127.0.0.1 8301 map[raft_vsn:2 vsn_max:3 role:consul dc:dc1 vsn:2 build:unknown’‘:2
c77151 port:8300 vsn_min:2 bootstrap:1 wan_join_port:8302 id:5807a948-eebf-0987-ca46-55fab62a46f2] alive 1 5 2 2 5 4}
2017/09/23 17:13:17 [WARN] consul: skipping reconcile of node {2d4eb3808ef9 127.0.0.1 8301 map[vsn_max:3 role:consul dc:dc1 raft_vsn:2 bootstrap:1 wan_join_po
rt:8302 id:5807a948-eebf-0987-ca46-55fab62a46f2 vsn:2 build:unknown’‘:2c77151 port:8300 vsn_min:2] alive 1 5 2 2 5 4}
2017/09/23 17:14:17 [WARN] consul: skipping reconcile of node {2d4eb3808ef9 127.0.0.1 8301 map[wan_join_port:8302 id:5807a948-eebf-0987-ca46-55fab62a46f2 vsn:
2 build:unknown’‘:2c77151 port:8300 vsn_min:2 bootstrap:1 role:consul dc:dc1 raft_vsn:2 vsn_max:3] alive 1 5 2 2 5 4}
2017/09/23 17:15:17 [WARN] consul: skipping reconcile of node {2d4eb3808ef9 127.0.0.1 8301 map[dc:dc1 raft_vsn:2 vsn_max:3 role:consul id:5807a948-eebf-0987-c
a46-55fab62a46f2 vsn:2 build:unknown’‘:2c77151 port:8300 vsn_min:2 bootstrap:1 wan_join_port:8302] alive 1 5 2 2 5 4}
2017/09/23 17:16:17 [WARN] consul: skipping reconcile of node {2d4eb3808ef9 127.0.0.1 8301 map[vsn:2 build:unknown’‘:2c77151 port:8300 vsn_min:2 bootstrap:1 w
an_join_port:8302 id:5807a948-eebf-0987-ca46-55fab62a46f2 raft_vsn:2 vsn_max:3 role:consul dc:dc1] alive 1 5 2 2 5 4}
2017/09/23 17:17:17 [WARN] consul: skipping reconcile of node {2d4eb3808ef9 127.0.0.1 8301 map[vsn_max:3 role:consul dc:dc1 raft_vsn:2 bootstrap:1 wan_join_po
rt:8302 id:5807a948-eebf-0987-ca46-55fab62a46f2 vsn:2 build:unknown’':2c77151 port:8300 vsn_min:2] alive 1 5 2 2 5 4}

Can you please advise me on how to proceed?

Thanks,
Tanveer

Hi tanveermadan ,

I haven’t seen this issue before, but you appear to be on the right track. It clearly looks as though something is wrong in the consul database. It looks like it is just one node that is causing the issue: . One option would be to remove these services from PMM via:

pmm-admin remove --all

Then check to see if the consul error has cleared.

If you are comfortable at the command line you may wish to consider exploring consul data, a starting place is here: [url]Catalog - HTTP API | Consul by HashiCorp

Hi Michael Coburn ,

Thanks for your response !!!
How do I identify which of the node is 2d4eb3808ef9?
I tried using http://<pmm-server>/v1/catalog/nodes but I am getting “Page cannot be displayed”

Also does the bcommand “pmm-admin remove --all” be need to run from all the nodes we are monitoring. Will it not remove the metrics gathered so far

Also I think the command may not succeed because pmm-admin check-network is failing from the target nodes

./pmm-admin check-network
Unable to connect to PMM server by address: 172.19.155.50
Get http://172.19.155.50: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

  • Check if the configured address is correct.
  • If server is running on non-default port, ensure it was specified along with the address.
  • If server is enabled for SSL or self-signed SSL, enable the corresponding option.
  • You may also check the firewall settings.

./pmm-admin list
Unable to connect to PMM server by address: 172.19.155.50
Get http://172.19.155.50: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

  • Check if the configured address is correct.
  • If server is running on non-default port, ensure it was specified along with the address.
  • If server is enabled for SSL or self-signed SSL, enable the corresponding option.
  • You may also check the firewall settings.

Thanks,
Tanveer