PMM as stateful set on K8S performance issues



Steps to Reproduce:

My company is deploying PMM as Statefulset with a single replica on K8S. We are monitoring close to 130 databases, including CloudSQL with MySQL, PostgreSQL, VM MySQL and VM MongoDB. There are significant performance issue after logging in, and when load PostgreSQL Overview Dashboard.
Tried to troubleshoot but I haven’t found any bottleneck that might cause the problem. Would be great help if anyone can help me with this issue.


Running on pmm 2.40.0


[If applicable, include any relevant log files or error messages]

Expected Result:

[What the user expected to see or happen before the issue occurred]

Actual Result:

[What actually happened when the user encountered the issue]

Additional Information:

Requesting resources:
memory: “4Gi”
cpu: “2”
memory: “8Gi”
cpu: “3”
name: pmm-storage
size: 40Gi

Hi, I think the resources you are giving PMM are too few. Can you try increasing the memory to 16G and cpu to 4?

@Vu_D_c_Minh are you trying to start Set up PMM in HA mode - Percona Monitoring and Management on k8s or some other deployment?
and yes, the increase of resources mentioned by @Ivan_Groenewold will be useful.

(The green lines are requested, red lines are limit and blue lines are utilization)

Hi we thought about adding more resources as well, but I don’t know if it would solve the problem. The amount of resources requested for PMM feel under utilized. I have screenshot how much it has used in the last 6 hours.

[What to encounter and temporary resolution]
The performance is unstable and randomly slow. Sometimes it would take 2-3 minutes loading, but sometimes it would just load instantly. We have tried refresh the page multiple times while loading and it helps a bit.

[PMM in HA mode]
We suspect that the bottleneck would be victoriametrics because we are sending multiple requests and receiving multiple responses while loading this. So to separate component would help. We are doing a POC but we just started recently and I see that not many people have tried this as it’s still in technical preview.

Do you have any advices on this matter? Is it suitable for now? Any way we can try to identify the bottleneck?

Thank you!