Not the answer you need?
Register and ask your own question!

PMM is losing instances

StaterosStateros ContributorCurrent User Role Patron
Hi. I have updated my pmm server from 1.1.1 to 1.1.5 (pmm-data created 7 monthes ago with 1.0.5) using instruction https://www.percona.com/doc/percona-monitoring-and-management/deploy/server/upgrade.html
Grafana and QAN lost all data. After few minutes all instances disapear from grafana. and on Prometheus I saw all endpoins with status 'DOWN' with error 'context deadline exceeded' (even for pmm-server and prometheus). On clients check-network shows that linux and mysql metrics DOWN. QAN works well without any problems.
I create new instance and install pmm-server and data container 1.1.5. On new one everything was good about 30 minutes then instances start to disappear one by one from grafana and from PROMETHEUS....

Now I have 2 pmm servers on different instances but both doesn't work.
«1

Comments

  • MykolaMykola Percona Percona Staff Role
    can you share output of the following command
    docker inspect pmm-data | grep Destination
    
    if it is empty - just full output of
  • StaterosStateros Contributor Current User Role Patron
    On new instance:
    ~# docker inspect pmm-data | grep Destination
    ["Destination": "/var/lib/grafana",
                "Destination": "/var/lib/mysql",
                "Destination": "/opt/consul-data",
                "Destination": "/opt/prometheus/data",
    

    On old one:
    ~# docker inspect pmm-data
    [
        {
            "Id": "eaf8681be6405469168eacad991f76a5352aace429976143148fb12bd356ce09",
            "Created": "2016-11-29T17:12:22.452424676Z",
            "Path": "/bin/true",
            "Args": [],
            "State": {
                "Status": "created",
                "Running": false,
                "Paused": false,
                "Restarting": false,
                "OOMKilled": false,
                "Dead": false,
                "Pid": 0,
                "ExitCode": 0,
                "Error": "",
                "StartedAt": "0001-01-01T00:00:00Z",
                "FinishedAt": "0001-01-01T00:00:00Z"
            },
            "Image": "sha256:cc8abb43be91c0c1e86c68455e3fe921193706db62d320884c574c610025be83",
            "ResolvConfPath": "",
            "HostnamePath": "",
            "HostsPath": "",
            "LogPath": "",
            "Name": "/pmm-data",
            "RestartCount": 0,
            "Driver": "devicemapper",
            "MountLabel": "",
            "ProcessLabel": "",
            "AppArmorProfile": "",
            "ExecIDs": null,
            "HostConfig": {
                "Binds": null,
                "ContainerIDFile": "",
                "LogConfig": {
                    "Type": "json-file",
                    "Config": {}
                },
                "NetworkMode": "bridge",
                "PortBindings": {},
                "RestartPolicy": {
                    "Name": "no",
                    "MaximumRetryCount": 0
                },
                "AutoRemove": false,
                "VolumeDriver": "",
                "VolumesFrom": null,
                "CapAdd": null,
                "CapDrop": null,
                "Dns": [],
                "DnsOptions": [],
                "DnsSearch": [],
                "ExtraHosts": null,
                "GroupAdd": null,
                "IpcMode": "",
                "Cgroup": "",
                "Links": null,
                "OomScoreAdj": 0,
                "PidMode": "",
                "Privileged": false,
                "PublishAllPorts": false,
                "ReadonlyRootfs": false,
                "SecurityOpt": null,
                "UTSMode": "",
                "UsernsMode": "",
                "ShmSize": 0,
                "ConsoleSize": [
                    0,
                    0
                ],
                "Isolation": "",
                "CpuShares": 0,
                "Memory": 0,
                "CgroupParent": "",
                "BlkioWeight": 0,
                "BlkioWeightDevice": null,
                "BlkioDeviceReadBps": null,
                "BlkioDeviceWriteBps": null,
                "BlkioDeviceReadIOps": null,
                "BlkioDeviceWriteIOps": null,
                "CpuPeriod": 0,
                "CpuQuota": 0,
                "CpusetCpus": "",
                "CpusetMems": "",
                "Devices": [],
                "DiskQuota": 0,
                "KernelMemory": 0,
                "MemoryReservation": 0,
                "MemorySwap": 0,
                "MemorySwappiness": null,
                "OomKillDisable": null,
                "PidsLimit": 0,
                "Ulimits": null,
                "CpuCount": 0,
                "CpuPercent": 0,
                "IOMaximumIOps": 0,
                "IOMaximumBandwidth": 0
            },
            "GraphDriver": {
                "Name": "devicemapper",
                "Data": {
                    "DeviceId": "30",
                    "DeviceName": "docker-202:1-657645-eaf8681be6405469168eacad991f76a5352aace429976143148fb12bd356ce09",
                    "DeviceSize": "10737418240"
                }
            },
            "Mounts": [],
            "Config": {
                "Hostname": "eaf8681be640",
                "Domainname": "",
                "User": "",
                "AttachStdin": false,
                "AttachStdout": true,
                "AttachStderr": true,
                "ExposedPorts": {
                    "443/tcp": {},
                    "80/tcp": {}
                },
                "Tty": false,
                "OpenStdin": false,
                "StdinOnce": false,
                "Env": [
                    "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
                ],
                "Cmd": [
                    "/bin/true"
                ],
                "Image": "percona/pmm-server:1.0.5",
                "Volumes": {
                    "/opt/consul-data": {},
                    "/opt/prometheus/data": {},
                    "/var/lib/grafana": {},
                    "/var/lib/mysql": {}
                },
                "WorkingDir": "/opt",
                "Entrypoint": null,
                "OnBuild": null,
                "Labels": {}
            },
            "NetworkSettings": {
                "Bridge": "",
                "SandboxID": "",
                "HairpinMode": false,
                "LinkLocalIPv6Address": "",
                "LinkLocalIPv6PrefixLen": 0,
                "Ports": null,
                "SandboxKey": "",
                "SecondaryIPAddresses": null,
                "SecondaryIPv6Addresses": null,
                "EndpointID": "",
                "Gateway": "",
                "GlobalIPv6Address": "",
                "GlobalIPv6PrefixLen": 0,
                "IPAddress": "",
                "IPPrefixLen": 0,
                "IPv6Gateway": "",
                "MacAddress": "",
                "Networks": null
            }
        }
    ]
    
  • MykolaMykola Percona Percona Staff Role
    can you update pmm-client on all hosts also to the latest version please?
  • StaterosStateros Contributor Current User Role Patron
    All clients have updated as soon as server updated. Version the same 1.1.5
  • MykolaMykola Percona Percona Staff Role
    can you share the output of command?
  • StaterosStateros Contributor Current User Role Patron
    ~# pmm-admin check-network
    PMM Network Status
    
    Server Address | IP
    Client Address | IP
    
    * System Time
    NTP Server (0.pool.ntp.org)         | 2017-07-14 17:24:52 +0000 UTC
    PMM Server                          | 2017-07-14 17:24:52 +0000 GMT
    PMM Client                          | 2017-07-14 17:24:52 +0000 UTC
    PMM Server Time Drift               | OK
    PMM Client Time Drift               | OK
    PMM Client to PMM Server Time Drift | OK
    
    * Connection: Client --> Server
    -------------------- -------
    SERVER SERVICE       STATUS
    -------------------- -------
    Consul API           OK
    Prometheus API       OK
    Query Analytics API  OK
    
    Connection duration | 425.573µs
    Request duration    | 707.882µs
    Full round trip     | 1.133455ms
    
    
    * Connection: Client <-- Server
    -------------- ------------------------------- -------------------- ------- ---------- ---------
    SERVICE TYPE   NAME                            REMOTE ENDPOINT      STATUS  HTTPS/TLS  PASSWORD
    -------------- ------------------------------- -------------------- ------- ---------- ---------
    linux:metrics  NAME  IP:42000  DOWN    YES        -
    mysql:metrics  NAME  IP:42002  DOWN    YES        -
    
    When an endpoint is down it may indicate that the corresponding service is stopped (run 'pmm-admin list' to verify).
    If it's running, check out the logs /var/log/pmm-*.log
    
    When all endpoints are down but 'pmm-admin list' shows they are up and no errors in the logs,
    check the firewall settings whether this system allows incoming connections from server to address:port in question.
    
    Also you can check the endpoint status by the URL: http://pmm_server/prometheus/targets
    

    And from PMM-server instance
    # nc -vz client_ip 42000
    Connection to client_ip 42000 port [tcp/*] succeeded!
    # nc -vz client_ip 42002
    Connection to client_ip 42000 port [tcp/*] succeeded
    
  • MykolaMykola Percona Percona Staff Role
    can you share output of the following command:
    docker exec -it pmm-server curl --insecure https://CLIENT-IP:42000/metrics | head
    
    please run it on PMM Server side
  • StaterosStateros Contributor Current User Role Patron
    What it shows? Some kind of ping?
    ~# docker exec -it pmm-server curl --insecure https://CLIENT-IP:42000/metrics | head
    # HELP go_gc_duration_seconds A summary of the GC invocation durations.
    # TYPE go_gc_duration_seconds summary
    go_gc_duration_seconds{quantile="0"} 3.1984e-05
    go_gc_duration_seconds{quantile="0.25"} 4.4129e-05
    go_gc_duration_seconds{quantile="0.5"} 0.001346496
    go_gc_duration_seconds{quantile="0.75"} 0.00406372
    go_gc_duration_seconds{quantile="1"} 0.012393153
    go_gc_duration_seconds_sum 0.099726205
    go_gc_duration_seconds_count 41
    # HELP go_goroutines Number of goroutines that currently exist.
    write /dev/stdout: broken pipe
    
  • MykolaMykola Percona Percona Staff Role
    hmm, strange.

    can you check again - is https://CLIENT-IP:42000/metrics target really down on prometheus targets page?
    because it is accessible from docker container.

    can you also measure response time? please share the output of the following command
    docker exec -it pmm-server bash -c 'time curl --insecure https://CLIENT-IP:42000/metrics >/dev/null'
    
  • StaterosStateros Contributor Current User Role Patron
    I've checked once more. Current client disappeared from prometheus. I don't see it on UI. But pmm-admin list show that everything ok. I double check host and IPs.
    # docker exec -it pmm-server bash -c 'time curl --insecure https://CLIENT-IP:42000/metrics >/dev/null'
      % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                     Dload  Upload   Total   Spent    Left  Speed
    100 83206  100 83206    0     0   689k      0 --:--:-- --:--:-- --:--:--  694k
    
    real    0m0.123s
    user    0m0.049s
    sys    0m0.037s
    

    It's really strange behaviur. I didn't see anything like this before. Usually I run upgrade commands and everything was perfect. But not this time.
  • MykolaMykola Percona Percona Staff Role
    can you remove and add this client again?

    like:
    pmm-admin add mysql
    pmm-admin remove mysql
    
  • StaterosStateros Contributor Current User Role Patron
    Nothing change.

    But when I'am totally remove instance from monitoring and added it back with pmm-admin config, pmm-admin add mysql. All works fine. I will looking how this instance will be.
  • StaterosStateros Contributor Current User Role Patron
    I have upgraded clients and server to 1.2.0 version. I see the same picture. In prometheus all endpoints are down. In grafana I see nothing. QAN working ok.

    1 difference - instances is not disappearing.
  • MykolaMykola Percona Percona Staff Role
    can you choose any endpoint which is down and run the following commands for it?
    docker exec -it pmm-server curl --insecure ENDPOINT_URL | head
    docker exec -it pmm-server bash -c 'time curl --insecure ENDPOINT_URL >/dev/null'
    
  • StaterosStateros Contributor Current User Role Patron
    I have upgraded hardware and docker to 17.06 CE. But nothing change.

    Here are commands:
    # docker exec -it pmm-server curl --insecure https://IP:42000/metrics | head
    # HELP go_gc_duration_seconds A summary of the GC invocation durations.
    # TYPE go_gc_duration_seconds summary
    go_gc_duration_seconds{quantile="0"} 2.0336e-05
    go_gc_duration_seconds{quantile="0.25"} 4.4266e-05
    go_gc_duration_seconds{quantile="0.5"} 5.1988e-05
    go_gc_duration_seconds{quantile="0.75"} 6.3676e-05
    go_gc_duration_seconds{quantile="1"} 0.008225946
    go_gc_duration_seconds_sum 1.300182597
    go_gc_duration_seconds_count 1414
    # HELP go_goroutines Number of goroutines that currently exist.
    

    I think it's not normal:
    # docker exec -it pmm-server bash -c 'time curl --insecure https://IP >/dev/null'
      % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                     Dload  Upload   Total   Spent    Left  Speed
      0     0    0     0    0     0      0      0 --:--:--  0:02:06 --:--:--     0curl: (7) Failed connect to 172.25.74.241:443; Connection timed out
    
    real    2m7.224s
    user    0m0.012s
    sys    0m0.024s
    
  • MykolaMykola Percona Percona Staff Role
    Hi Stateros

    pay attention that we are testing the same URL in both commands
    URL which we are testing should be down in targets list!

    I need to understand two things: 1) is target reachable from PMM Server and 2) how big response time is.

    both commands can looks like this
    docker exec -it pmm-server curl --insecure https://172.25.74.241:42002/metrics-hr | tail
    docker exec -it pmm-server bash -c 'time curl --insecure https://172.25.74.241:42002/metrics-hr >/dev/null'
    
  • StaterosStateros Contributor Current User Role Patron
    Sorry, my bad.
    # docker exec -it pmm-server curl --insecure https://172.25.74.241:42002/metrics-hr | tail
    mysql_global_status_threads_running 1
    # HELP mysql_global_status_uptime Generic metric from SHOW GLOBAL STATUS.
    # TYPE mysql_global_status_uptime untyped
    mysql_global_status_uptime 1.6175063e+07
    # HELP mysql_global_status_uptime_since_flush_status Generic metric from SHOW GLOBAL STATUS.
    # TYPE mysql_global_status_uptime_since_flush_status untyped
    mysql_global_status_uptime_since_flush_status 1.6175063e+07
    # HELP mysql_up Whether the MySQL server is up.
    # TYPE mysql_up gauge
    mysql_up 1
    
    
    
    # docker exec -it pmm-server bash -c 'time curl --insecure https://172.25.74.241:42002/metrics-hr >/dev/null'
      % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                     Dload  Upload   Total   Spent    Left  Speed
    100 48999  100 48999    0     0   184k      0 --:--:-- --:--:-- --:--:--  185k
    
    real    0m0.276s
    user    0m0.060s
    sys    0m0.052s
    
  • MykolaMykola Percona Percona Staff Role
    Looks very strange.

    according to prometheus targets, screenshot URL cannot be fetched.
    but according to output, it can be fetched from the container without any problems.

    can you share the full output of the following command? (you can replace actual IP and password by zeros if needed)
    sudo cat /usr/local/percona/pmm-client/pmm.yml
    
    it is needed to run it on problematic client
  • StaterosStateros Contributor Current User Role Patron
    ~# cat /usr/local/percona/pmm-client/pmm.yml
    server_address: pmm.qa.com
    client_address: 172.25.74.241
    bind_address: 172.25.74.241
    client_name: DB-qa-master
    
  • StaterosStateros Contributor Current User Role Patron
    Should I re run container with additional argument like
    docker run -d   -p 80:80  --volumes-from pmm-data  --name pmm-server  --restart always -e METRICS_MEMORY=4194304 percona/pmm-server:1.2.0
    

    And how can I check that memory changes applied?
  • MykolaMykola Percona Percona Staff Role
    Stateros wrote: »
    Should I re run container with additional argument like

    yes, please
    Stateros wrote: »
    And how can I check that memory changes applied?

    run the following command
    ps ax | grep prometheus
    
    (4194304*1024)
  • StaterosStateros Contributor Current User Role Patron
    It's some kind of magic, but nothing changes. I setup new instance m4.large, set 4 gb for Prometheus.
    ~# ps aux | grep promet
    ubuntu   17110  1.6  0.6 489068 56784 ?        Sl   09:28   0:07 /usr/sbin/prometheus -config.file=/etc/prometheus.yml -storage.local.path=/opt/prometheus/data -web.listen-address=:9090 -storage.local.retention=720h --storage.local.target-heap-size=[B]4294967296[/B] -storage.local.chunk-encoding-version=2 -web.console.libraries=/usr/share/prometheus/console_libraries -web.console.templates=/usr/share/prometheus/consoles -web.external-url=http://localhost:9090/prometheus/
    


    All previous commands shows the same results.
    #docker exec -it pmm-server curl --insecure https://172.25.74.241:42002/metrics-hr | tail
    mysql_global_status_threads_running 1
    # HELP mysql_global_status_uptime Generic metric from SHOW GLOBAL STATUS.
    # TYPE mysql_global_status_uptime untyped
    mysql_global_status_uptime 1.6333843e+07
    # HELP mysql_global_status_uptime_since_flush_status Generic metric from SHOW GLOBAL STATUS.
    # TYPE mysql_global_status_uptime_since_flush_status untyped
    mysql_global_status_uptime_since_flush_status 1.6333843e+07
    # HELP mysql_up Whether the MySQL server is up.
    # TYPE mysql_up gauge
    mysql_up 1
    
    
    ~# docker exec -it pmm-server bash -c 'time curl --insecure https://172.25.74.241:42002/metrics-hr &gt;/dev/null'
      % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                     Dload  Upload   Total   Spent    Left  Speed
    100 49001  100 49001    0     0   206k      0 --:--:-- --:--:-- --:--:--  207k
    
    real    0m0.246s
    user    0m0.054s
    sys    0m0.067s
    
    ~# cat /usr/local/percona/pmm-client/pmm.yml
    server_address: pmm.qa.com client_address: 172.25.74.241 bind_address: 172.25.74.241 client_name: DB-qa-master
    

    Maybe screenshots will help?
  • StaterosStateros Contributor Current User Role Patron
    One more interesting thing. CPU usage unbelievable high for pmm-server.
    top - 11:43:16 up  2:22,  1 user,  load average: 1.87, 2.03, 2.09
    Tasks: 148 total,   2 running, 146 sleeping,   0 stopped,   0 zombie
    %Cpu(s): 98.7 us,  0.3 sy,  0.0 ni,  0.2 id,  0.2 wa,  0.0 hi,  0.7 si,  0.0 st
    KiB Mem:   8175648 total,  3409272 used,  4766376 free,   177884 buffers
    KiB Swap:        0 total,        0 used,        0 free.  2531164 cached Mem
    
      PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
    29714 ubuntu    20   0  645656 277648  12844 S 196.4  3.4   3:29.37 prometheus                                      
    29729 root      20   0  207684  12232   4832 S   1.7  0.1   0:01.69 node_exporter
    29704 998       20   0 1379428  75344   6912 S   1.0  0.9   0:00.92 mysqld                                              
    7 root      20   0       0      0      0 S   0.3  0.0   0:05.37 rcu_sched                                        
    29670 root      20   0  117344  14932   3920 S   0.3  0.2   0:00.36 supervisord
    29705 ubuntu    20   0  231832  21688  10252 S   0.3  0.3   0:00.52 consul                                          
    29782 ubuntu    20   0  215512  11076   4712 S   0.3  0.1   0:00.31 orchestrator                                        
    1 root      20   0   33640   2940   1468 S   0.0  0.0   0:01.74 init
        2 root      20   0       0      0      0 S   0.0  0.0   0:00.00 kthreadd
    
  • RoelVandePaarRoelVandePaar Contributor Inactive User Role Beginner
    I notice that "--storage.local.target-heap-size=4294967296" has two dashes, whereas all other options (for example "-storage.local.chunk-encoding-version=2" and "-storage.local.retention=720h") have one dash - maybe it did not apply the update as double dashes were used whereas Prometheus may require one dash? Please check.
  • StaterosStateros Contributor Current User Role Patron
    Thanks, will check. Also I will try to use instance with 4 core
  • StaterosStateros Contributor Current User Role Patron
    I have increased type of instance to m4.xlarge and set 4GB memory for Prometheus. After 3 hours working everything is looks good. FINALLY.
    I am very surprised that prometheus use CPU so intensive. And memory almost all free. 1.07GB used from 15.67GB

    Thanks Mykola, for your patience and help.
    Also thanks to RoelVandePaar, 2 dashs it's ok.
  • StaterosStateros Contributor Current User Role Patron
    I did it like in guide.

    I used command
    "docker run -d -p 80:80 --volumes-from pmm-data --name pmm-server --restart always -e METRICS_MEMORY=4194304 percona/1.2.0"
    
  • MykolaMykola Percona Percona Staff Role
    looks like more memory is not needed for prometheus right now
    is CPU usage become normal?
    do you have any issues right now?
    how many database instances do you have?
Sign In or Register to comment.

MySQL, InnoDB, MariaDB and MongoDB are trademarks of their respective owners.
Copyright ©2005 - 2020 Percona LLC. All rights reserved.