PostgreSQL 14.2 + PMM 2.26.0-6.el8
Hi guys,
I have a problem with the PMM application and memory usage.
I have two servers with several PostgreSQL instances, these servers are replicated against each other. The PMM application server is located outside the database servers. Only pmm-client is installed on both PostgreSQL servers. The first server is working fine, but the problem is with the second server.
For some time I have noticed the following entries in the system logs:
*Jul 13 01:23:00 XXXXXX pmm-agent[3280]: #033[36mINFO#033[0m[2022-07-13T01:23:00.027+02:00] Sending 14 buckets. #033[36magentID#033[0m=/agent_id/XXXXXX #033[36mcomponent#033[0m=agent-builtin #033[36mtype#033[0m=qan_postgresql_pgstatements_agent*
*Jul 13 01:23:12 XXXXXX pmm-agent[3280]: #033[36mINFO#033[0m[2022-07-13T01:23:12.514+02:00] time="2022-07-13T01:23:12+02:00" level=error msg="error retrieving settings: error running query on database \"XXXXXX:5436\": pg read tcp XXXXXX:37866->XXXXXX:5436: i/o timeout" source="postgres_exporter.go:1612" #033[36magentID#033[0m=/agent_id/XXXXXX #033[36mcomponent#033[0m=agent-process #033[36mtype#033[0m=postgres_exporter*
*Jul 13 01:23:12 XXXXXX pmm-agent[3280]: #033[36mINFO#033[0m[2022-07-13T01:23:12.641+02:00] time="2022-07-13T01:23:12+02:00" level=info msg="Error running query on database \"XXXXXX:5434\": pg_postmaster_uptime read tcp XXXXXX:48400->XXXXXX:5434: i/o timeout" source="postgres_exporter.go:1433" #033[36magentID#033[0m=/agent_id/XXXXXX #033[36mcomponent#033[0m=agent-process #033[36mtype#033[0m=postgres_exporter*
*Jul 13 01:23:12 XXXXXX pmm-agent[3280]: #033[36mINFO#033[0m[2022-07-13T01:23:12.687+02:00] time="2022-07-13T01:23:12+02:00" level=error msg="queryNamespaceMappings returned 1 errors" source="postgres_exporter.go:1612" #033[36magentID#033[0m=/agent_id/XXXXXX #033[36mcomponent#033[0m=agent-process #033[36mtype#033[0m=postgres_exporter*
*Jul 13 01:23:12 XXXXXX pmm-agent[3280]: #033[36mINFO#033[0m[2022-07-13T01:23:12.969+02:00] time="2022-07-13T01:23:12+02:00" level=error msg="Error opening connection to database (postgres://pgpool:PASSWORD_REMOVED@XXXXXX:5432/XXXXXX?connect_timeout=1&sslmode=disable): \"read tcp XXXXXX:52026->XXXXXX:5432: i/o timeout\": too many connection retries" source="postgres_exporter.go:1612" #033[36magentID#033[0m=/agent_id/XXXXXX #033[36mcomponent#033[0m=agent-process #033[36mtype#033[0m=postgres_exporter*
Logs about “Error opening connection to database”, “i/o timeout": too many connection retries”, “error when scraping”, “Proceeding with outdated query maps, as the Postgres version could not be determined: error scanning version string on” repeats for a few minutes, then an oom_killer is called, which kills all processes in a sequence, including postgres. The machine then dies. This situation was repeated 4 times in the last month. A week ago I disabled the PMM service on this server - it works fine so far.
There are 512GB of RAM on this machine. 354 GB RAM consumes HugePages, the rest remain free for the operating system. Do you have any ideas what is the cause of the failure?