Pmm-agent dies after some time requires restart

Hi,

pmm-agent dies and requires restart by sudo systemctl restart pmm-agent

Here is the log:

● pmm-agent.service - pmm-agent
     Loaded: loaded (/lib/systemd/system/pmm-agent.service; enabled; vendor preset: enabled)
     Active: active (running) since Sun 2022-02-20 09:59:58 GMT; 6min ago
   Main PID: 597288 (pmm-agent)
      Tasks: 72 (limit: 19100)
     Memory: 211.7M
     CGroup: /system.slice/pmm-agent.service
             ├─597288 /usr/sbin/pmm-agent --config-file=/usr/local/percona/pmm2/config/pmm-agent.yaml
             ├─597320 /usr/local/percona/pmm2/exporters/vmagent -envflag.enable=true -httpListenAddr=127.0.0.1:42000 -loggerLevel=INFO -promscrape.config=/tmp/vm_agent/agent_id/0c3af199-5b2c-4383-a150-e23054826538/vmagentscrapecfg -remoteWrite.maxDiskUsagePerURL=1073741824 -remoteWrite.tlsInsecureSkipVerify=true -remoteWrite.tmpDataPath=/tmp/vmagent-temp-dir -remoteWrite.url=https://192.168.20.10:32043/victoriametrics/api/v1/write
             ├─597321 /usr/local/percona/pmm2/exporters/postgres_exporter --auto-discover-databases --collect.custom_query.hr --collect.custom_query.hr.directory=/usr/local/percona/pmm2/collectors/custom-queries/postgresql/high-resolution --collect.custom_query.lr --collect.custom_query.lr.directory=/usr/local/percona/pmm2/collectors/custom-queries/postgresql/low-resolution --collect.custom_query.mr --collect.custom_query.mr.directory=/usr/local/percona/pmm2/collectors/custom-queries/postgresql/medium-resolution --exclude-databases=template0,template1,postgres,cloudsqladmin,pmm-managed-dev,azure_maintenance --web.listen-address=:42001
             └─597330 /usr/local/percona/pmm2/exporters/node_exporter --collector.bonding --collector.buddyinfo --collector.cpu --collector.diskstats --collector.entropy --collector.filefd --collector.filesystem --collector.hwmon --collector.loadavg --collector.meminfo --collector.meminfo_numa --collector.netdev --collector.netstat --collector.netstat.fields=^(.*_(InErrors|InErrs|InCsumErrors)|Tcp_(ActiveOpens|PassiveOpens|RetransSegs|CurrEstab|AttemptFails|OutSegs|InSegs|EstabResets|OutRsts|OutSegs)|Tcp_Rto(Algorithm|Min|Max)|Udp_(RcvbufErrors|SndbufErrors)|Udp(6?|Lite6?)_(InDatagrams|OutDatagrams|RcvbufErrors|SndbufErrors|NoPorts)|Icmp6?_(OutEchoReps|OutEchos|InEchos|InEchoReps|InAddrMaskReps|InAddrMasks|OutAddrMaskReps|OutAddrMasks|InTimestampReps|InTimestamps|OutTimestampReps|OutTimestamps|OutErrors|InDestUnreachs|OutDestUnreachs|InTimeExcds|InRedirects|OutRedirects|InMsgs|OutMsgs)|IcmpMsg_(InType3|OutType3)|Ip(6|Ext)_(InOctets|OutOctets)|Ip_Forwarding|TcpExt_(Listen.*|Syncookies.*|TCPTimeouts))$ --collector.processes --collector.standard.go --collector.standard.process --collector.stat --collector.textfile.directory.hr=/usr/local/percona/pmm2/collectors/textfile-collector/high-resolution --collector.textfile.directory.lr=/usr/local/percona/pmm2/collectors/textfile-collector/low-resolution --collector.textfile.directory.mr=/usr/local/percona/pmm2/collectors/textfile-collector/medium-resolution --collector.textfile.hr --collector.textfile.lr --collector.textfile.mr --collector.time --collector.uname --collector.vmstat --collector.vmstat.fields=^(pg(steal_(kswapd|direct)|refill|alloc)_(movable|normal|dma3?2?)|nr_(dirty.*|slab.*|vmscan.*|isolated.*|free.*|shmem.*|i?n?active.*|anon_transparent_.*|writeback.*|unstable|unevictable|mlock|mapped|bounce|page_table_pages|kernel_stack)|drop_slab|slabs_scanned|pgd?e?activate|pgpg(in|out)|pswp(in|out)|pgm?a?j?fault)$ --no-collector.arp --no-collector.bcache --no-collector.conntrack --no-collector.drbd --no-collector.edac --no-collector.infiniband --no-collector.interrupts --no-collector.ipvs --no-collector.ksmd --no-collector.logind --no-collector.mdadm --no-collector.mountstats --no-collector.netclass --no-collector.nfs --no-collector.nfsd --no-collector.ntp --no-collector.qdisc --no-collector.runit --no-collector.sockstat --no-collector.supervisord --no-collector.systemd --no-collector.tcpstat --no-collector.timex --no-collector.wifi --no-collector.xfs --no-collector.zfs --web.disable-exporter-metrics --web.listen-address=:42002

Feb 20 10:06:42 fuse pmm-agent[597288]: INFO[2022-02-20T10:06:42.763+00:00] time="2022-02-20T10:06:42Z" level=error msg="error encoding and sending metric family: write tcp 127.0.0.1:42001->127.0.0.1:39884: write: broken pipe\n" source="log.go:184"  agentID=/agent_id/8963d9bb-ff9a-41c8-82b8-c8a0f4a0e8ce component=agent-process type=postgres_exporter
Feb 20 10:06:42 fuse pmm-agent[597288]: INFO[2022-02-20T10:06:42.763+00:00] time="2022-02-20T10:06:42Z" level=error msg="error encoding and sending metric family: write tcp 127.0.0.1:42001->127.0.0.1:39884: write: broken pipe\n" source="log.go:184"  agentID=/agent_id/8963d9bb-ff9a-41c8-82b8-c8a0f4a0e8ce component=agent-process type=postgres_exporter
Feb 20 10:06:42 fuse pmm-agent[597288]: INFO[2022-02-20T10:06:42.763+00:00] time="2022-02-20T10:06:42Z" level=error msg="error encoding and sending metric family: write tcp 127.0.0.1:42001->127.0.0.1:39884: write: broken pipe\n" source="log.go:184"  agentID=/agent_id/8963d9bb-ff9a-41c8-82b8-c8a0f4a0e8ce component=agent-process type=postgres_exporter
Feb 20 10:06:42 fuse pmm-agent[597288]: INFO[2022-02-20T10:06:42.763+00:00] time="2022-02-20T10:06:42Z" level=error msg="error encoding and sending metric family: write tcp 127.0.0.1:42001->127.0.0.1:39884: write: broken pipe\n" source="log.go:184"  agentID=/agent_id/8963d9bb-ff9a-41c8-82b8-c8a0f4a0e8ce component=agent-process type=postgres_exporter
Feb 20 10:06:42 fuse pmm-agent[597288]: INFO[2022-02-20T10:06:42.763+00:00] time="2022-02-20T10:06:42Z" level=error msg="error encoding and sending metric family: write tcp 127.0.0.1:42001->127.0.0.1:39884: write: broken pipe\n" source="log.go:184"  agentID=/agent_id/8963d9bb-ff9a-41c8-82b8-c8a0f4a0e8ce component=agent-process type=postgres_exporter
Feb 20 10:06:42 fuse pmm-agent[597288]: INFO[2022-02-20T10:06:42.763+00:00] time="2022-02-20T10:06:42Z" level=error msg="error encoding and sending metric family: write tcp 127.0.0.1:42001->127.0.0.1:39884: write: broken pipe\n" source="log.go:184"  agentID=/agent_id/8963d9bb-ff9a-41c8-82b8-c8a0f4a0e8ce component=agent-process type=postgres_exporter
Feb 20 10:06:42 fuse pmm-agent[597288]: INFO[2022-02-20T10:06:42.763+00:00] time="2022-02-20T10:06:42Z" level=error msg="error encoding and sending metric family: write tcp 127.0.0.1:42001->127.0.0.1:39884: write: broken pipe\n" source="log.go:184"  agentID=/agent_id/8963d9bb-ff9a-41c8-82b8-c8a0f4a0e8ce component=agent-process type=postgres_exporter

After restarting the pmm-agent it starts to collect data again with no issues.
Any ideas how to fix it?
thanks

1 Like

Hi Fahad,

Is the issue reproducible? Does pmm-agent die if a postgresql service is removed and added again?

1 Like

Is the issue reproducible?

I didn’t do anything special. I just spun up the pmm-server via docker as per the docs.

Then started the postgres service like this on the ubuntu server running postgres:

sudo pmm-admin add postgresql --username='pmm' --password='my password'

Does pmm-agent die if a postgresql service is removed and added again?

I tried removing it via this:

pmm-admin remove postgresql
pmm-admin remove postgresql /service_id/38deb42a-fb83-4f35-adf5-d11b783cef16

It gives the error:

Service with name "/service_id/38deb42a-fb83-4f35-adf5-d11b783cef16" not found.

So how do I remove and re-add?
I looked here but no luck.

1 Like

when I restart the agent. It starts to work again for another 12 hours or so

1 Like

Services can be removed from monitoring by service_name.

e.g.
pmm-admin remove postgresql myPostgresqlService1

1 Like

thanks. There you go:

➜  ~ pmm-admin list
Service type        Service name           Address and port        Service ID
PostgreSQL          fuse-postgresql        127.0.0.1:5432          /service_id/38deb42a-fb83-4f35-adf5-d11b783cef16

Agent type                           Status           Metrics Mode        Agent ID                                              Service ID
pmm_agent                            Connected                            /agent_id/c244be73-6827-497c-b066-7aaa8ddcbad8
node_exporter                        Running          push                /agent_id/9eab03be-0ded-4c19-abba-766e5039ab57
postgres_exporter                    Running          push                /agent_id/8963d9bb-ff9a-41c8-82b8-c8a0f4a0e8ce        /service_id/38deb42a-fb83-4f35-adf5-d11b783cef16
postgresql_pgstatements_agent        Running                              /agent_id/963917ae-40f7-4bda-95cd-ccc09f0dd600        /service_id/38deb42a-fb83-4f35-adf5-d11b783cef16
vmagent                              Running          push                /agent_id/0c3af199-5b2c-4383-a150-e23054826538

Removing by:

➜  ~ pmm-admin remove postgresql fuse-postgresql
Service removed.

➜  ~ pmm-admin list
Service type        Service name        Address and port        Service ID

Agent type           Status           Metrics Mode        Agent ID                                              Service ID
pmm_agent            Connected                            /agent_id/c244be73-6827-497c-b066-7aaa8ddcbad8
node_exporter        Running          push                /agent_id/9eab03be-0ded-4c19-abba-766e5039ab57
vmagent              Running          push                /agent_id/0c3af199-5b2c-4383-a150-e23054826538

Agent still seem to be running:

sudo systemctl status pmm-agent
[sudo] password for fahadshery:
● pmm-agent.service - pmm-agent
     Loaded: loaded (/lib/systemd/system/pmm-agent.service; enabled; vendor preset: enabled)
     Active: active (running) since Mon 2022-02-21 15:28:56 GMT; 2h 59min ago
   Main PID: 1794130 (pmm-agent)
      Tasks: 55 (limit: 19100)
     Memory: 62.3M
     CGroup: /system.slice/pmm-agent.service
             ├─1794130 /usr/sbin/pmm-agent --config-file=/usr/local/percona/pmm2/config/pmm-agent.yaml
             ├─1794164 /usr/local/percona/pmm2/exporters/node_exporter --collector.bonding --collector.buddyinfo --collector.cpu --collector.diskstats --collector.entropy --coll>
             └─1930922 /usr/local/percona/pmm2/exporters/vmagent -envflag.enable=true -httpListenAddr=127.0.0.1:42000 -loggerLevel=INFO -promscrape.config=/tmp/vm_agent/agent_id>

Feb 21 18:26:12 fuse pmm-agent[1794130]: INFO[2022-02-21T18:26:12.327+00:00] 2022-02-21T18:26:12.327Z        info        VictoriaMetrics/lib/persistentqueue/fastqueue.go:59     >
Feb 21 18:26:12 fuse pmm-agent[1794130]: INFO[2022-02-21T18:26:12.328+00:00] 2022-02-21T18:26:12.327Z        info        VictoriaMetrics/app/vmagent/remotewrite/client.go:143   >
Feb 21 18:26:12 fuse pmm-agent[1794130]: INFO[2022-02-21T18:26:12.328+00:00] 2022-02-21T18:26:12.328Z        info        VictoriaMetrics/app/vmagent/main.go:112        started v>
Feb 21 18:26:12 fuse pmm-agent[1794130]: INFO[2022-02-21T18:26:12.328+00:00] 2022-02-21T18:26:12.328Z        info        VictoriaMetrics/lib/promscrape/scraper.go:96        read>
Feb 21 18:26:12 fuse pmm-agent[1794130]: INFO[2022-02-21T18:26:12.328+00:00] 2022-02-21T18:26:12.328Z        info        VictoriaMetrics/lib/httpserver/httpserver.go:82        s>

Now re-added the postgres service again. Will let you know how long it reports for the metrics:

sudo pmm-admin add postgresql --username='pmm' --password='my password'
1 Like

unfortunately the error returned. Here is the status:

sudo systemctl status pmm-agent
● pmm-agent.service - pmm-agent
     Loaded: loaded (/lib/systemd/system/pmm-agent.service; enabled; vendor preset: enabled)
     Active: active (running) since Mon 2022-02-21 15:28:56 GMT; 4h 31min ago
   Main PID: 1794130 (pmm-agent)
      Tasks: 79 (limit: 19100)
     Memory: 196.6M
     CGroup: /system.slice/pmm-agent.service
             ├─1794130 /usr/sbin/pmm-agent --config-file=/usr/local/percona/pmm2/config/pmm-agent.yaml
             ├─1794164 /usr/local/percona/pmm2/exporters/node_exporter --collector.bonding --collector.buddyinfo --collector.cpu --collector.diskstats --collector.entropy --coll>
             ├─1931050 /usr/local/percona/pmm2/exporters/postgres_exporter --auto-discover-databases --collect.custom_query.hr --collect.custom_query.hr.directory=/usr/local/per>
             └─1931068 /usr/local/percona/pmm2/exporters/vmagent -envflag.enable=true -httpListenAddr=127.0.0.1:42000 -loggerLevel=INFO -promscrape.config=/tmp/vm_agent/agent_id>

Feb 21 20:00:41 fuse pmm-agent[1794130]: INFO[2022-02-21T20:00:41.006+00:00] time="2022-02-21T20:00:40Z" level=error msg="error encoding and sending metric family: write tcp 127>
Feb 21 20:00:41 fuse pmm-agent[1794130]: INFO[2022-02-21T20:00:41.006+00:00] time="2022-02-21T20:00:40Z" level=error msg="error encoding and sending metric family: write tcp 127>
Feb 21 20:00:41 fuse pmm-agent[1794130]: INFO[2022-02-21T20:00:41.006+00:00] time="2022-02-21T20:00:40Z" level=error msg="error encoding and sending metric family: write tcp 127>
Feb 21 20:00:41 fuse pmm-agent[1794130]: INFO[2022-02-21T20:00:41.006+00:00] time="2022-02-21T20:00:40Z" level=error msg="error encoding and sending metric family: write tcp 127>
Feb 21 20:00:41 fuse pmm-agent[1794130]: INFO[2022-02-21T20:00:41.006+00:00] time="2022-02-21T20:00:40Z" level=error msg="error encoding and sending metric family: write tcp 127>
Feb 21 20:00:41 fuse pmm-agent[1794130]: INFO[2022-02-21T20:00:41.006+00:00] time="2022-02-21T20:00:40Z" level=error msg="error encoding and sending metric family: write tcp 127>
Feb 21 20:00:41 fuse pmm-agent[1794130]: INFO[2022-02-21T20:00:41.006+00:00] time="2022-02-21T20:00:40Z" level=error msg="error encoding and sending metric family: write tcp 127>
Feb 21 20:00:41 fuse pmm-agent[1794130]: INFO[2022-02-21T20:00:41.006+00:00] time="2022-02-21T20:00:40Z" level=error msg="error encoding and sending metric family: write tcp 127>
Feb 21 20:00:41 fuse pmm-agent[1794130]: INFO[2022-02-21T20:00:41.006+00:00] time="2022-02-21T20:00:40Z" level=error msg="error encoding and sending metric family: write tcp 127>
Feb 21 20:00:45 fuse pmm-agent[1794130]: INFO[2022-02-21T20:00:45.599+00:00] 2022-02-21T20:00:45.599Z        error        VictoriaMetrics/lib/promscrape/scrapework.go:258        error when scraping "http://127.0.0.1:42003/metrics?collect%5B%5D=custom_query&collect%5B%5D=exporter&collect%5B%5D=standard.go&collect%5B%5D=standard.process" from job "postgres_exporter_agent_id_2ae3cac5-ddf0-4e3a-b8e5-a2b1b6406a23_hr-5s" with labels {agent_id="2ae3cac5-ddf0-4e3a-b8e5-a2b1b6406a23",agent_type="postgres_exporter",instance=="/agent_id/2ae3cac5-ddf0-4e3a-b8e5-a2b1b6406a23",job="postgres_exporter_agent_id_2ae3cac5-ddf0-4e3a-b8e5-a2b1b6406a23_hr-5s",machine_id="/machine_id/f7808b4544aa4de49e1af28f0fac6570",node_id="/node_id/5f52fec3-5757-454c-a96a-3577369297a8",node_name="fuse",node_type="generic",service_id="/service_id/10fe3b80-38d3-4b57-b11b-6266c0c8f133",service_name="fuse-postgresql",service_type="postgresql"}: cannot read data: cannot scrape "http://127.0.0.1:42003/metrics?collect%5B%5D=custom_query&collect%5B%5D=exporter&collect%5B%5D=standard
1 Like

Hi Fahad,

Do you use any custom query file for postgresql service?

1 Like

Do you use any custom query file for postgresql service?

yes, I have the this and this as per the instructions at:

and

1 Like

It looks like exporter can’t process custom queries during the chosen metrics resolution period.
Could you move queries files into low-resolution folder and restart pmm-agent?

1 Like

Could you move queries files into low-resolution folder and restart pmm-agent?

already https://raw.githubusercontent.com/Percona-Lab/pmm-custom-queries/master/postgresql/pg_tuple_statistics.yaml and

https://raw.githubusercontent.com/Percona-Lab/pmm-custom-queries/master/postgresql/pg_table_size-details.yaml are placed in the /usr/local/percona/pmm2/collectors/custom-queries/postgresql/low-resolution as per the instructions. I am not using any other locations…

cd /usr/local/percona/pmm2/collectors/custom-queries/postgresql/low-resolution
➜  low-resolution ll
total 12K
-rw-r--r-- 1 pmm-agent pmm-agent  472 Feb  3 14:17 example-queries-postgres.yml
-rw-r--r-- 1 pmm-agent pmm-agent 1.4K Feb 19 22:05 pg_table_size-details.yaml
-rw-r--r-- 1 pmm-agent pmm-agent 3.7K Feb 19 22:34 pg_tuple_statistics.yaml
1 Like