Hello,
I am experiencing a severe performance issue with PMM 3.6.0 where the server becomes extremely slow and CPU usage reaches 100% continuously.
Investigation shows that ClickHouse repeatedly attempts background merges on the system.metric_log table, but every merge attempt fails with a MEMORY_LIMIT_EXCEEDED exception. The failed merge is immediately retried, resulting in an infinite loop that consumes CPU.
This effectively makes the PMM instance almost unusable.
Environment
-
PMM version: 3.6.0
-
ClickHouse version bundled with PMM: 25.3.6.56
-
Deployment: containerized environment (Podman)
No unusual workload was running when the issue appeared.
Observed behaviour
ClickHouse repeatedly schedules merges on the system.metric_log table:
system.metric_log (MergerMutator): Selected parts for merge
system.metric_log (MergerMutator): Merged parts
However the merge then fails with:
Code: 241. DB::Exception: (total) memory limit exceeded
would use 5.84 GiB (attempt to allocate chunk of ~4 MiB)
current RSS: ~1.3 GiB
maximum: 5.84 GiB
The error occurs during the merge execution phase:
While executing MergeTreeSequentialSource
while reading from part ... in table system.metric_log
Immediately after the failure, ClickHouse schedules the same merge again, which fails in the same way. This cycle repeats continuously.
As a result:
-
ClickHouse background threads consume full CPU
-
PMM UI becomes very slow
-
the system remains in a constant retry loop
We face memory issue since we upgraded PMM server & clients from 3.4.0 to 3.6.0 and it re-appears again after re-deployment of server containers.
Are you aware of this issue ?
Thanks in advance for any help