PMM is losing instances

Hi. I have updated my pmm server from 1.1.1 to 1.1.5 (pmm-data created 7 monthes ago with 1.0.5) using instruction [url]https://www.percona.com/doc/percona-monitoring-and-management/deploy/server/upgrade.html[/url]
Grafana and QAN lost all data. After few minutes all instances disapear from grafana. and on Prometheus I saw all endpoins with status ‘DOWN’ with error ‘context deadline exceeded’ (even for pmm-server and prometheus). On clients check-network shows that linux and mysql metrics DOWN. QAN works well without any problems.
I create new instance and install pmm-server and data container 1.1.5. On new one everything was good about 30 minutes then instances start to disappear one by one from grafana and from PROMETHEUS…

Now I have 2 pmm servers on different instances but both doesn’t work.

can you share output of the following command

docker inspect pmm-data | grep Destination

if it is empty - just full output of

On new instance:

~# docker inspect pmm-data | grep Destination
["Destination": "/var/lib/grafana",
"Destination": "/var/lib/mysql",
"Destination": "/opt/consul-data",
"Destination": "/opt/prometheus/data",

On old one:

~# docker inspect pmm-data
[
{
"Id": "eaf8681be6405469168eacad991f76a5352aace429976143148fb12bd356ce09",
"Created": "2016-11-29T17:12:22.452424676Z",
"Path": "/bin/true",
"Args": [],
"State": {
"Status": "created",
"Running": false,
"Paused": false,
"Restarting": false,
"OOMKilled": false,
"Dead": false,
"Pid": 0,
"ExitCode": 0,
"Error": "",
"StartedAt": "0001-01-01T00:00:00Z",
"FinishedAt": "0001-01-01T00:00:00Z"
},
"Image": "sha256:cc8abb43be91c0c1e86c68455e3fe921193706db62d320884c574c610025be83",
"ResolvConfPath": "",
"HostnamePath": "",
"HostsPath": "",
"LogPath": "",
"Name": "/pmm-data",
"RestartCount": 0,
"Driver": "devicemapper",
"MountLabel": "",
"ProcessLabel": "",
"AppArmorProfile": "",
"ExecIDs": null,
"HostConfig": {
"Binds": null,
"ContainerIDFile": "",
"LogConfig": {
"Type": "json-file",
"Config": {}
},
"NetworkMode": "bridge",
"PortBindings": {},
"RestartPolicy": {
"Name": "no",
"MaximumRetryCount": 0
},
"AutoRemove": false,
"VolumeDriver": "",
"VolumesFrom": null,
"CapAdd": null,
"CapDrop": null,
"Dns": [],
"DnsOptions": [],
"DnsSearch": [],
"ExtraHosts": null,
"GroupAdd": null,
"IpcMode": "",
"Cgroup": "",
"Links": null,
"OomScoreAdj": 0,
"PidMode": "",
"Privileged": false,
"PublishAllPorts": false,
"ReadonlyRootfs": false,
"SecurityOpt": null,
"UTSMode": "",
"UsernsMode": "",
"ShmSize": 0,
"ConsoleSize": [
0,
0
],
"Isolation": "",
"CpuShares": 0,
"Memory": 0,
"CgroupParent": "",
"BlkioWeight": 0,
"BlkioWeightDevice": null,
"BlkioDeviceReadBps": null,
"BlkioDeviceWriteBps": null,
"BlkioDeviceReadIOps": null,
"BlkioDeviceWriteIOps": null,
"CpuPeriod": 0,
"CpuQuota": 0,
"CpusetCpus": "",
"CpusetMems": "",
"Devices": [],
"DiskQuota": 0,
"KernelMemory": 0,
"MemoryReservation": 0,
"MemorySwap": 0,
"MemorySwappiness": null,
"OomKillDisable": null,
"PidsLimit": 0,
"Ulimits": null,
"CpuCount": 0,
"CpuPercent": 0,
"IOMaximumIOps": 0,
"IOMaximumBandwidth": 0
},
"GraphDriver": {
"Name": "devicemapper",
"Data": {
"DeviceId": "30",
"DeviceName": "docker-202:1-657645-eaf8681be6405469168eacad991f76a5352aace429976143148fb12bd356ce09",
"DeviceSize": "10737418240"
}
},
"Mounts": [],
"Config": {
"Hostname": "eaf8681be640",
"Domainname": "",
"User": "",
"AttachStdin": false,
"AttachStdout": true,
"AttachStderr": true,
"ExposedPorts": {
"443/tcp": {},
"80/tcp": {}
},
"Tty": false,
"OpenStdin": false,
"StdinOnce": false,
"Env": [
"PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
],
"Cmd": [
"/bin/true"
],
"Image": "percona/pmm-server:1.0.5",
"Volumes": {
"/opt/consul-data": {},
"/opt/prometheus/data": {},
"/var/lib/grafana": {},
"/var/lib/mysql": {}
},
"WorkingDir": "/opt",
"Entrypoint": null,
"OnBuild": null,
"Labels": {}
},
"NetworkSettings": {
"Bridge": "",
"SandboxID": "",
"HairpinMode": false,
"LinkLocalIPv6Address": "",
"LinkLocalIPv6PrefixLen": 0,
"Ports": null,
"SandboxKey": "",
"SecondaryIPAddresses": null,
"SecondaryIPv6Addresses": null,
"EndpointID": "",
"Gateway": "",
"GlobalIPv6Address": "",
"GlobalIPv6PrefixLen": 0,
"IPAddress": "",
"IPPrefixLen": 0,
"IPv6Gateway": "",
"MacAddress": "",
"Networks": null
}
}
]

can you update pmm-client on all hosts also to the latest version please?

All clients have updated as soon as server updated. Version the same 1.1.5

can you share the output of command?

~# pmm-admin check-network
PMM Network Status

Server Address | IP
Client Address | IP

* System Time
NTP Server (0.pool.ntp.org) | 2017-07-14 17:24:52 +0000 UTC
PMM Server | 2017-07-14 17:24:52 +0000 GMT
PMM Client | 2017-07-14 17:24:52 +0000 UTC
PMM Server Time Drift | OK
PMM Client Time Drift | OK
PMM Client to PMM Server Time Drift | OK

* Connection: Client --> Server
-------------------- -------
SERVER SERVICE STATUS
-------------------- -------
Consul API OK
Prometheus API OK
Query Analytics API OK

Connection duration | 425.573µs
Request duration | 707.882µs
Full round trip | 1.133455ms


* Connection: Client <-- Server
-------------- ------------------------------- -------------------- ------- ---------- ---------
SERVICE TYPE NAME REMOTE ENDPOINT STATUS HTTPS/TLS PASSWORD
-------------- ------------------------------- -------------------- ------- ---------- ---------
linux:metrics NAME IP:42000 DOWN YES -
mysql:metrics NAME IP:42002 DOWN YES -

When an endpoint is down it may indicate that the corresponding service is stopped (run 'pmm-admin list' to verify).
If it's running, check out the logs /var/log/pmm-*.log

When all endpoints are down but 'pmm-admin list' shows they are up and no errors in the logs,
check the firewall settings whether this system allows incoming connections from server to address:port in question.

Also you can check the endpoint status by the URL: http://pmm_server/prometheus/targets

And from PMM-server instance

# nc -vz client_ip 42000
Connection to client_ip 42000 port [tcp/*] succeeded!
# nc -vz client_ip 42002
Connection to client_ip 42000 port [tcp/*] succeeded

can you share output of the following command:

docker exec -it pmm-server curl --insecure https://CLIENT-IP:42000/metrics | head

please run it on PMM Server side

What it shows? Some kind of ping?

~# docker exec -it pmm-server curl --insecure https://CLIENT-IP:42000/metrics | head
# HELP go_gc_duration_seconds A summary of the GC invocation durations.
# TYPE go_gc_duration_seconds summary
go_gc_duration_seconds{quantile="0"} 3.1984e-05
go_gc_duration_seconds{quantile="0.25"} 4.4129e-05
go_gc_duration_seconds{quantile="0.5"} 0.001346496
go_gc_duration_seconds{quantile="0.75"} 0.00406372
go_gc_duration_seconds{quantile="1"} 0.012393153
go_gc_duration_seconds_sum 0.099726205
go_gc_duration_seconds_count 41
# HELP go_goroutines Number of goroutines that currently exist.
write /dev/stdout: broken pipe

hmm, strange.

can you check again - is [URL]https://CLIENT-IP:42000/metrics[/URL] target really down on prometheus targets page?
because it is accessible from docker container.

can you also measure response time? please share the output of the following command

docker exec -it pmm-server bash -c 'time curl --insecure https://CLIENT-IP:42000/metrics >/dev/null'

I’ve checked once more. Current client disappeared from prometheus. I don’t see it on UI. But pmm-admin list show that everything ok. I double check host and IPs.

# docker exec -it pmm-server bash -c 'time curl --insecure https://CLIENT-IP:42000/metrics >/dev/null'
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 83206 100 83206 0 0 689k 0 --:--:-- --:--:-- --:--:-- 694k

real 0m0.123s
user 0m0.049s
sys 0m0.037s

It’s really strange behaviur. I didn’t see anything like this before. Usually I run upgrade commands and everything was perfect. But not this time.

can you remove and add this client again?

like:

pmm-admin add mysql
pmm-admin remove mysql

Nothing change.

But when I’am totally remove instance from monitoring and added it back with pmm-admin config, pmm-admin add mysql. All works fine. I will looking how this instance will be.

I have upgraded clients and server to 1.2.0 version. I see the same picture. In prometheus all endpoints are down. In grafana I see nothing. QAN working ok.

1 difference - instances is not disappearing.

can you choose any endpoint which is down and run the following commands for it?


docker exec -it pmm-server curl --insecure ENDPOINT_URL | head
docker exec -it pmm-server bash -c 'time curl --insecure ENDPOINT_URL >/dev/null'

I have upgraded hardware and docker to 17.06 CE. But nothing change.

Here are commands:

# docker exec -it pmm-server curl --insecure https://IP:42000/metrics | head
# HELP go_gc_duration_seconds A summary of the GC invocation durations.
# TYPE go_gc_duration_seconds summary
go_gc_duration_seconds{quantile="0"} 2.0336e-05
go_gc_duration_seconds{quantile="0.25"} 4.4266e-05
go_gc_duration_seconds{quantile="0.5"} 5.1988e-05
go_gc_duration_seconds{quantile="0.75"} 6.3676e-05
go_gc_duration_seconds{quantile="1"} 0.008225946
go_gc_duration_seconds_sum 1.300182597
go_gc_duration_seconds_count 1414
# HELP go_goroutines Number of goroutines that currently exist.

I think it’s not normal:

# docker exec -it pmm-server bash -c 'time curl --insecure https://IP >/dev/null'
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- 0:02:06 --:--:-- 0curl: (7) Failed connect to 172.25.74.241:443; Connection timed out

real 2m7.224s
user 0m0.012s
sys 0m0.024s

Hi Stateros

pay attention that we are testing the same URL in both commands
URL which we are testing should be down in targets list!

I need to understand two things: 1) is target reachable from PMM Server and 2) how big response time is.

both commands can looks like this


docker exec -it pmm-server curl --insecure https://172.25.74.241:42002/metrics-hr | tail
docker exec -it pmm-server bash -c 'time curl --insecure https://172.25.74.241:42002/metrics-hr >/dev/null'

Sorry, my bad.

# docker exec -it pmm-server curl --insecure https://172.25.74.241:42002/metrics-hr | tail
mysql_global_status_threads_running 1
# HELP mysql_global_status_uptime Generic metric from SHOW GLOBAL STATUS.
# TYPE mysql_global_status_uptime untyped
mysql_global_status_uptime 1.6175063e+07
# HELP mysql_global_status_uptime_since_flush_status Generic metric from SHOW GLOBAL STATUS.
# TYPE mysql_global_status_uptime_since_flush_status untyped
mysql_global_status_uptime_since_flush_status 1.6175063e+07
# HELP mysql_up Whether the MySQL server is up.
# TYPE mysql_up gauge
mysql_up 1



# docker exec -it pmm-server bash -c 'time curl --insecure https://172.25.74.241:42002/metrics-hr >/dev/null'
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 48999 100 48999 0 0 184k 0 --:--:-- --:--:-- --:--:-- 185k

real 0m0.276s
user 0m0.060s
sys 0m0.052s

Looks very strange.

according to prometheus targets, screenshot URL cannot be fetched.
but according to output, it can be fetched from the container without any problems.

can you share the full output of the following command? (you can replace actual IP and password by zeros if needed)

sudo cat /usr/local/percona/pmm-client/pmm.yml

it is needed to run it on problematic client

~# cat /usr/local/percona/pmm-client/pmm.yml
server_address: pmm.qa.com
client_address: 172.25.74.241
bind_address: 172.25.74.241
client_name: DB-qa-master