pmm instance down ?

scar2yjs · December 14, 2016, 8:25pm

hello,

currnetly our pmm ec2 instance has been crashed … because too many open tcp sockets ? … what’s the problem?

remaining too many logs…
2016/12/15 10:24:41 http: Accept error: accept tcp 172.17.0.1:42002: accept4: too many open files; retrying in 5ms
2016/12/15 10:24:41 http: Accept error: accept tcp 172.17.0.1:42002: accept4: too many open files; retrying in 10ms
2016/12/15 10:24:42 http: Accept error: accept tcp 172.17.0.1:42002: accept4: too many open files; retrying in 20ms
2016/12/15 10:24:43 http: TLS handshake error from 172.17.0.2:59002: write tcp 172.17.0.1:42002->172.17.0.2:59002: write: broken pipe
2016/12/15 10:24:43 http: TLS handshake error from 172.17.0.2:48450: EOF
2016/12/15 10:24:44 http: TLS handshake error from 172.17.0.2:60626: write tcp 172.17.0.1:42002->172.17.0.2:60626: write: broken pipe

thanks : (

scar2yjs · December 14, 2016, 9:17pm

[root@ip-10-2-21-65 log]# docker version
Client:
Version: 1.12.2
API version: 1.24
Go version: go1.6.3
Git commit: bb80604
Built:
OS/Arch: linux/amd64

Server:
Version: 1.12.2
API version: 1.24
Go version: go1.6.3
Git commit: bb80604
Built:
OS/Arch: linux/amd64

REPOSITORY TAG IMAGE ID CREATED SIZE
percona/pmm-server 1.0.7 a91f4f6237a9 5 days ago 714.4 MB
percona/pmm-server latest 0eade99a1612 8 weeks ago 652.9 MB

[root@ip-10-2-21-65 log]# pmm-admin -v
1.0.7

scar2yjs · December 14, 2016, 9:40pm

weber · December 15, 2016, 2:15am

Are you saying PMM caused “too many open tcp sockets” problem?
Do you have netstat stats from that?

scar2yjs · December 15, 2016, 3:00am

the sockets continues to increase and server becomes unavailable.
tls errors and down status.
3, prometheus/targets → Get [URL]http://localhost:9100/metrics:[/URL] dial tcp [::1]:9100: i/o timeout
if i access the endpoint using curl then i can see ssl error.

netstat
tcp 132 0 172.17.0.1:42003 172.17.0.2:35546 ESTABLISHED off (0.00/0/0)
tcp 0 0 172.17.0.1:42010 172.17.0.2:46726 ESTABLISHED keepalive (73.38/0/0)
tcp 0 0 172.17.0.1:42002 172.17.0.2:41970 ESTABLISHED keepalive (30.89/0/0)
tcp 132 0 172.17.0.1:42006 172.17.0.2:43960 ESTABLISHED off (0.00/0/0)
tcp 0 0 172.17.0.1:42011 172.17.0.2:56352 ESTABLISHED keepalive (87.72/0/0)
tcp 0 0 172.17.0.1:42003 172.17.0.2:53234 ESTABLISHED keepalive (120.49/0/0)
tcp 132 0 172.17.0.1:42005 172.17.0.2:47648 ESTABLISHED off (0.00/0/0)
tcp 0 0 172.17.0.1:42007 172.17.0.2:33106 ESTABLISHED keepalive (176.81/0/0)
tcp 0 0 172.17.0.1:42007 172.17.0.2:36374 ESTABLISHED keepalive (99.49/0/0)
tcp 0 0 172.17.0.1:42002 172.17.0.2:47632 ESTABLISHED keepalive (81.06/0/0)
tcp 132 0 172.17.0.1:42009 172.17.0.2:46248 ESTABLISHED off (0.00/0/0)
tcp 132 0 172.17.0.1:42008 172.17.0.2:32788 ESTABLISHED off (0.00/0/0)

weber · December 15, 2016, 4:13am

Are you using internal docker IPs to communicate between server and client?

Looks like client address is 172.17.0.1
Can you connect from the inside the container to 172.17.0.1 on any service port, e.g. 42002?

Client address should be set to underlying host private ip. Internal docker ips may not work.

scar2yjs · December 15, 2016, 6:40am

thanks weber,

i think that pmm-admin address/name are automatically setting.
im try to change the info. but everything not changed .

// ========================== //
pmm-admin config --bind-address 10.2.21.65
pmm-admin config --client-address 10.2.21.65

pmm-admin info

pmm-admin 1.0.7

PMM Server | localhost
Client Name | ip-10-2-21-65
Client Address | 172.17.0.1
Service Manager | linux-systemd

Go Version | 1.7.4
Runtime Info | linux/amd64

// ============================ //

telnet 172.17.0.1 42002

Trying 172.17.0.1…
Connected to 172.17.0.1.
Escape character is ‘^]’.

#curl https://172.17.0.1:42002/metrics-hr
curl: (60) Issuer certificate is invalid.
More details here: http://curl.haxx.se/docs/sslcerts.html

curl performs SSL certificate verification by default, using a “bundle”
of Certificate Authority (CA) public keys (CA certs). If the default
bundle file isn’t adequate, you can specify an alternate file
using the --cacert option.

curl -i http://172.17.0.1:42002

scar2yjs · December 16, 2016, 2:21am

first of all, re-config(reset) of pmm-admin config --options is not changed anything btw server. eg, pmm-admin config --server xxxxx --client-address xxxx .
so i re try to remove/reinstall containers and pmm-client on 4 ec2 instances all, and i can see correct pmm-admin info.

[root@ip-10-2-21-65 source]# pmm-admin check-network
PMM Network Status

Server Address | 10.2.21.xx
Client Address | 10.2.21.xx

Connection: Client ← Server

SERVICE TYPE NAME REMOTE ENDPOINT STATUS HTTPS/TLS PASSWORD

mysql:metrics maindb01 10.2.21.xx:42002 DOWN YES -

still i can not solve Client ← Server Down Remote endpoint status. and not have any deny firewall.

weber · December 16, 2016, 2:45am

This command should work from the inside of container:
docker exec -ti pmm-server bash
curl --insecure https://10.2.21.xx:42002

scar2yjs · December 16, 2016, 3:16am

check below ..

docker exec -ti pmm-server-df bash

root@88119a40dbbe:/opt# curl --insecure https://10.2.21.xx:42002

MySQL 3-in-1 exporter

high-res metrics

medium-res metrics

low-res metrics

weber · December 16, 2016, 4:39am

So it works. If you go to /prometheus/targets page on the server, what do you see?
How much memory available on the server where docker runs and how many PMM clients do you have?

scar2yjs · December 18, 2016, 10:22pm

Ive several times checked to pmm status and logs. The problems have irrupted with cpu 100% (prometheus process only), memory leak, socket increasing → server hang when using t2.medium/t2.large ec2 on AWS env. when i use 1.0,4 version, it was enough.
when starting containers and pmm-admin , /prometheus/targets all up status, after a bit, it changed DOWN state all. pmm web Is no longer available.

ec2 intance : 1
docker images : 2
docker container : 2
pmm-client : 1
metrics/query : 20
account limit: 10 connections.

2016/12/18 00:11:00.235457 analyzer.go:426: qan-analyzer-9117f541-worker crashed: ‘61 2016-12-17 15:10:00 UTC to 2016-12-17 15:11:00 UTC (0-0)’: runtime error: invalid memory address or nil pointer dereference
goroutine 3402694 [running]:
runtime/debug.Stack(0x4868ec, 0xc42000e0f0, 0x2)
/usr/local/go1.7.4/src/runtime/debug/stack.go:24 +0x79
runtime/debug.PrintStack()
/usr/local/go1.7.4/src/runtime/debug/stack.go:16 +0x22
github.com/percona/qan-agent/qan.(*RealAnalyzer).runWorker.func1(0xc42018a000, 0xc420b566c0)
/mnt/workspace/pmm-client-tarball/pmm-client-1.0.7/src/github.com/percona/qan-agent/qan/analyzer.go:427 +0x1f6
panic(0x717600, 0xc42000c060)

weber · December 21, 2016, 5:56am

It is very strange, it seems to me the underlying environment is very unstable.
It can be memory ballooning when one instance takes over resources from other which is usually a case on shared environments w/o resource reservation.
Can you try an instance with guaranteed amount of resources?
For PMM, IO is not that important, it is more CPU/memory sensitive.

Topic		Replies	Views
PMM is losing instances PMM 1.x	34	3049	July 24, 2017
Pmm client crashes PMM 1.x	12	1365	January 3, 2017
PMM server cannot collect metrics from a client PMM 1.x	2	1109	March 22, 2018
pmm-server was unable to connect pmm-client to collect linux:metrics PMM 1.x	20	6753	February 24, 2021
pmm-server not collecting mysql metrics from client PMM 1.x	3	948	June 20, 2018