PMM 3.6.0 – ClickHouse background merge loop on system.metric_log causing 100% CPU and repeated MEMORY_LIMIT_EXCEEDED

Hello,

I am experiencing a severe performance issue with PMM 3.6.0 where the server becomes extremely slow and CPU usage reaches 100% continuously.

Investigation shows that ClickHouse repeatedly attempts background merges on the system.metric_log table, but every merge attempt fails with a MEMORY_LIMIT_EXCEEDED exception. The failed merge is immediately retried, resulting in an infinite loop that consumes CPU.

This effectively makes the PMM instance almost unusable.

Environment

  • PMM version: 3.6.0

  • ClickHouse version bundled with PMM: 25.3.6.56

  • Deployment: containerized environment (Podman)

No unusual workload was running when the issue appeared.

Observed behaviour

ClickHouse repeatedly schedules merges on the system.metric_log table:

system.metric_log (MergerMutator): Selected parts for merge
system.metric_log (MergerMutator): Merged parts

However the merge then fails with:

Code: 241. DB::Exception: (total) memory limit exceeded
would use 5.84 GiB (attempt to allocate chunk of ~4 MiB)
current RSS: ~1.3 GiB
maximum: 5.84 GiB

The error occurs during the merge execution phase:

While executing MergeTreeSequentialSource
while reading from part ... in table system.metric_log

Immediately after the failure, ClickHouse schedules the same merge again, which fails in the same way. This cycle repeats continuously.

As a result:

  • ClickHouse background threads consume full CPU

  • PMM UI becomes very slow

  • the system remains in a constant retry loop

We face memory issue since we upgraded PMM server & clients from 3.4.0 to 3.6.0 and it re-appears again after re-deployment of server containers.

Are you aware of this issue ?

Thanks in advance for any help

@Henryx

I see some fixes proposed for v3.7.0 that should address these problems. Meanwhile, you can test with the mentioned changes to see if it helps.

https://perconadev.atlassian.net/issues?jql=textfields%20~%20"clickhouse%20memory*"&selectedIssue=PMM-14722

When using ClickHouse with less than 16GB of RAM, we recommend the following:

  • Lower the size of the mark cache in the config.xml. It can be set as low as 500 MB, but it cannot be set to zero.
  • Lower the number of query processing threads down to 1.
  • Lower the max_block_size to 8192. Values as low as 1024 can still be practical.
  • Lower max_download_threads to 1.
  • Set input_format_parallel_parsing and output_format_parallel_formatting to 0.
  • disable writing in log tables, as it keeps the background merge task reserving RAM to perform merges of log tables. Disable asynchronous_metric_log, metric_log, text_log, trace_log.

The similar memory-exceed issue discussed here: https://perconadev.atlassian.net/browse/PMM-14788 & PMM-14788 Increase memory resources for ClickHouse by ademidoff · Pull Request #779 · percona/percona-helm-charts · GitHub as well.

Increased the limits for ClickHouse server resources to 8Gi memory and 4 CPU, up from 4Gi and 2 respectively.

Removed the explicit requests for memory and CPU, which previously reserved 1Gi memory and 500m CPU.

Did you set any resource quota for Docker/Podman? What are the OS resources (CPU/Memory) ?