Errors in collecting metrics under high database load

Hello!
I installed PMM server version 3. On a highly loaded test server running PostgreSQL version 14, I installed the PMM agent version 3 and pg_stat_monitor 2.2. Metrics are being received.

Agent settings:
pg_stat_monitor.pgsm_bucket_time: 60
pg_stat_monitor.pgsm_enable_overflow: true
pg_stat_monitor.pgsm_enable_pgsm_query_id: false
pg_stat_monitor.pgsm_enable_query_plan: true
pg_stat_monitor.pgsm_histogram_max: 30000
pg_stat_monitor.pgsm_histogram_min: 50
pg_stat_monitor.pgsm_max: 512
pg_stat_monitor.pgsm_max_buckets: 20
pg_stat_monitor.pgsm_normalized_query: true
pg_stat_monitor.pgsm_query_shared_buffer: 384
pg_stat_monitor.pgsm_track_application_names: false
pg_stat_monitor.pgsm_track_planning: false

Under high load on PostgreSQL, when QPS exceeds 10K, errors occur:
caller=postgres_exporter.go:770 level=error err=ā€œError opening connection to database (postgres://sa_percona_monitoring:PASSWORD_REMOVED@localhost:5432/base_1?connect_timeout=1&sslmode=disable): read tcp 127.0.0.1:34926->127.0.0.1:5432: i/o timeoutā€" agentID=25836318-331d-4630-9cf8-7b739b9c5f4f component=agent-process type=postgres_exporter
caller=postgres_exporter.go:770 level=error err=ā€œError opening connection to database (postgres://sa_percona_monitoring:PASSWORD_REMOVED@localhost:5432/base_2?connect_timeout=1&sslmode=disable): read tcp 127.0.0.1:34944->127.0.0.1:5432: i/o timeoutā€" agentID=25836318-331d-4630-9cf8-7b739b9c5f4f component=agent-process type=postgres_exporter

Additionally, there are gaps in the graphs. PostgreSQL metrics are missing in PMM. At the moment when the metrics from pg_stat_monitor disappear, work with PostgreSQL continues - the services send requests and receive responses from the database.

The following messages are appearing in the PostgreSQL logs:
LOG: temporary file: path ā€œbase/pgsql_tmp/pgsql_tmp3145589.7ā€, size 314170608
STATEMENT: SELECT /* agent=ā€˜pgstatmonitor’ */ ā€œpg_stat_monitorā€.ā€œbucketā€, ā€œpg_stat_monitorā€.ā€œclient_ipā€, ā€œpg_stat_monitorā€.ā€œqueryā€, ā€œpg_stat_monitorā€.ā€œcallsā€, ā€œpg_stat_monitorā€.ā€œshared_blks_hitā€, ā€œpg_stat_monitorā€.ā€œshared_blks_readā€, ā€œpg_stat_monitorā€.ā€œshared_blks_dirtiedā€, ā€œpg_stat_monitorā€.ā€œshared_blks_writtenā€, ā€œpg_stat_monitorā€.ā€œlocal_blks_hitā€, ā€œpg_stat_monitorā€.ā€œlocal_blks_readā€, ā€œpg_stat_monitorā€.ā€œlocal_blks_dirtiedā€, ā€œpg_stat_monitorā€.ā€œlocal_blks_writtenā€, ā€œpg_stat_monitorā€.ā€œtemp_blks_readā€, ā€œpg_stat_monitorā€.ā€œtemp_blks_writtenā€, ā€œpg_stat_monitorā€.ā€œresp_callsā€, ā€œpg_stat_monitorā€.ā€œcpu_user_timeā€, ā€œpg_stat_monitorā€.ā€œcpu_sys_timeā€, ā€œpg_stat_monitorā€.ā€œrowsā€, ā€œpg_stat_monitorā€.ā€œrelationsā€, ā€œpg_stat_monitorā€.ā€œdatnameā€, ā€œpg_stat_monitorā€.ā€œuseridā€, ā€œpg_stat_monitorā€.ā€œtop_queryidā€, ā€œpg_stat_monitorā€.ā€œplanidā€, ā€œpg_stat_monitorā€.ā€œquery_planā€, ā€œpg_stat_monitorā€.ā€œtop_queryā€, ā€œpg_stat_monitorā€.ā€œapplication_nameā€, ā€œpg_stat_monitorā€.ā€œcmd_typeā€, ā€œpg_stat_monitorā€.ā€œcmd_type_textā€, ā€œpg_stat_monitorā€.ā€œelevelā€, ā€œpg_stat_monitorā€.ā€œsqlcodeā€, ā€œpg_stat_monitorā€.ā€œmessageā€, ā€œpg_stat_monitorā€.ā€œpgsm_query_idā€, ā€œpg_stat_monitorā€.ā€œdbidā€, ā€œpg_stat_monitorā€.ā€œblk_read_timeā€, ā€œpg_stat_monitorā€.ā€œblk_write_timeā€, ā€œpg_stat_monitorā€.ā€œtotal_exec_timeā€, ā€œpg_stat_monitorā€.ā€œmin_exec_timeā€, ā€œpg_stat_monitorā€.ā€œmax_exec_timeā€, ā€œpg_stat_monitorā€.ā€œmean_exec_timeā€, ā€œpg_stat_monitorā€.ā€œstddev_exec_timeā€, ā€œpg_stat_monitorā€.ā€œtotal_plan_timeā€, ā€œpg_stat_monitorā€.ā€œmin_plan_timeā€, ā€œpg_stat_monitorā€.ā€œmax_plan_timeā€, ā€œpg_stat_monitorā€.ā€œmean_plan_timeā€, ā€œpg_stat_monitorā€.ā€œwal_recordsā€, ā€œpg_stat_monitorā€.ā€œwal_fpiā€, ā€œpg_stat_monitorā€.ā€œwal_bytesā€, ā€œpg_stat_monitorā€.ā€œplansā€, ā€œpg_stat_monitorā€.ā€œcommentsā€, ā€œpg_stat_monitorā€.ā€œbucket_start_timeā€, ā€œpg_stat_monitorā€.ā€œusernameā€ FROM ā€œpg_stat_monitorā€ WHERE queryid IS NOT NULL AND query IS NOT NULL AND bucket_done AND pgsm_query_id IS NOT NULL
WARNING: [pg_stat_monitor] pgsm_store: Hash table is out of memory and can no longer store queries!
DETAIL: You may reset the view or when the buckets are deallocated, pg_stat_monitor will resume saving queries. Alternatively, try increasing the value of pg_stat_monitor.pgsm_max.
WARNING: [pg_stat_monitor] pg_stat_monitor_internal: Hash table is out of memory and can no longer store queries!
DETAIL: You may reset the view or when the buckets are deallocated, pg_stat_monitor will resume saving queries. Alternatively, try increasing the value of pg_stat_monitor.pgsm_max.

1 Like

My issue is very similar to this one: https://perconadev.atlassian.net/browse/PMM-8646

How can I increase the timeout for getting a response from PostgreSQL for postgres_exporter?

1 Like

Hello!

The question remains open, and resolving this issue is a priority for us.

Why does a ā€œtimed I/Oā€ error occur in the postgres_exporter component under high PostgreSQL load? At the same time, all other metrics from the host are being received normally.

Log:

pmm-agent[2113148]: time="2025-11-17T11:50:38.255+00:00" level=info msg="Sending 2644 buckets." agentID=ae04d56c-6ec5-4c48-80da-cb8951b8eb7c component=agent-builtin type=qan_postgresql_pgstatmonitor_agent

pmm-agent[2113148]: time="2025-11-17T11:50:39.820+00:00" level=error msg="ts=2025-11-17T11:50:39.820Z caller=datasource.go:107 
level=error msg=\"Error opening connection to database\" dsn=\"postgres://sa_percona_monitoring:PASSWORD_REMOVED@server:5432/postgres?connect_timeout=1&sslmode=disable\" 
err=\"read tcp host:59514->host:5432: i/o timeout\"" agentID=85953d85-a453-4e9b-a67d-241144e64b6d component=agent-process type=postgres_exporter

pmm-agent[2113148]: time="2025-11-17T11:50:41.822+00:00" level=error msg="ts=2025-11-17T11:50:41.822Z caller=postgres_exporter.go:770 
level=error err=\"Error opening connection to database (postgres://sa_percona_monitoring:PASSWORD_REMOVED@server:5432/postgres?connect_timeout=1&sslmode=disable): 
read tcp host:59522->host:5432: i/o timeout\"" agentID=85953d85-a453-4e9b-a67d-241144e64b6d component=agent-process type=postgres_exporter

Under high load, when the issue with postgres_exporter occurs, other connections and query executions proceed normally. No other errors are observed on PostgreSQL itself. The only error log is in the pmm-agent, which I provided you earlier.

1 Like