PMM doesn't start under high load and consumes all allocated CPU resources

Ivan_Piskunov · July 14, 2023, 9:58am

Description:

We have a fairly large production infrastructure with around 100 MySQL and 30 MongoDB databases, running on approximately 25 physical machines in docker under high load.

The problem is that when we need to restart the PMM pod (for example, during an update), Grafana doesn’t start until we stop the pmm-client.service on all machines running the databases (clients).
We set a limit of 10 CPUs for the pod, but when it starts, it consumes everything without any result.(“1” in screenshot)
However, if we start the pod initially and then connect all the clients to the already running pod, it consumes no more than 1.5 CPUs (“3” in screenshot).

The problem also lies in the fact that the pod can randomly start using all the allocated CPUs (in the example in the screenshot, 4CPUs (“2” in screenshot)), causing Grafana to crash, which won’t start until we repeat the steps described above.
In the logs, it only shows that it’s trying to restart Grafana but without success.

Steps to Reproduce:

I’m not sure how to reproduce this issue outside of our production cluster.
We have a staging environment with an identical helm chart configuration but significantly fewer databases and lower load, and everything works fine there.

Version:

kubernetes version: 1.20.7

pmm-admin/pmm-agent version:
PMMVersion: 2.38.1
FullCommit: f7772e88783b177278830fbdf23b4aa526f33bf2

helm-chart version: 1.2.3

Logs:

Crash loop when grafana crashes after all CPU were used (at 12:51):

WARN[2023-07-12T08:26:12.586+00:00] Configuration warning: unknown environment variable "LC_ALL=en_US.utf8".
2023-07-12 08:26:12,726 INFO Included extra file "/etc/supervisord.d/alertmanager.ini" during parsing
2023-07-12 08:26:12,727 INFO Included extra file "/etc/supervisord.d/dbaas-controller.ini" during parsing
2023-07-12 08:26:12,727 INFO Included extra file "/etc/supervisord.d/grafana.ini" during parsing
2023-07-12 08:26:12,727 INFO Included extra file "/etc/supervisord.d/pmm.ini" during parsing
2023-07-12 08:26:12,727 INFO Included extra file "/etc/supervisord.d/prometheus.ini" during parsing
2023-07-12 08:26:12,727 INFO Included extra file "/etc/supervisord.d/qan-api2.ini" during parsing
2023-07-12 08:26:12,727 INFO Included extra file "/etc/supervisord.d/supervisord.ini" during parsing
2023-07-12 08:26:12,727 INFO Included extra file "/etc/supervisord.d/victoriametrics.ini" during parsing
2023-07-12 08:26:12,727 INFO Included extra file "/etc/supervisord.d/vmalert.ini" during parsing
2023-07-12 08:26:12,727 INFO Included extra file "/etc/supervisord.d/vmproxy.ini" during parsing
2023-07-12 08:26:12,727 INFO Set uid to user 0 succeeded
2023-07-12 08:26:12,729 INFO RPC interface 'supervisor' initialized
2023-07-12 08:26:12,729 INFO supervisord started with pid 1
2023-07-12 08:26:13,732 INFO spawned: 'pmm-update-perform-init' with pid 20
2023-07-12 08:26:13,735 INFO spawned: 'postgresql' with pid 21
2023-07-12 08:26:13,737 INFO spawned: 'clickhouse' with pid 22
2023-07-12 08:26:13,739 INFO spawned: 'grafana' with pid 26
2023-07-12 08:26:13,741 INFO spawned: 'nginx' with pid 31
2023-07-12 08:26:13,743 INFO spawned: 'victoriametrics' with pid 33
2023-07-12 08:26:13,745 INFO spawned: 'vmalert' with pid 34
2023-07-12 08:26:13,747 INFO spawned: 'alertmanager' with pid 35
2023-07-12 08:26:13,749 INFO spawned: 'vmproxy' with pid 40
2023-07-12 08:26:13,750 INFO spawned: 'qan-api2' with pid 41
2023-07-12 08:26:13,752 INFO spawned: 'pmm-managed' with pid 42
2023-07-12 08:26:13,753 INFO spawned: 'pmm-agent' with pid 48
2023-07-12 08:26:13,765 INFO exited: qan-api2 (exit status 1; not expected)
2023-07-12 08:26:14,731 INFO success: pmm-update-perform-init entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2023-07-12 08:26:14,733 INFO success: postgresql entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2023-07-12 08:26:14,736 INFO success: clickhouse entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2023-07-12 08:26:14,738 INFO success: grafana entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2023-07-12 08:26:14,740 INFO success: nginx entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2023-07-12 08:26:14,741 INFO success: victoriametrics entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2023-07-12 08:26:14,743 INFO success: vmalert entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2023-07-12 08:26:14,745 INFO success: alertmanager entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2023-07-12 08:26:14,748 INFO success: vmproxy entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2023-07-12 08:26:14,751 INFO success: pmm-managed entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2023-07-12 08:26:14,752 INFO success: pmm-agent entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2023-07-12 08:26:14,766 INFO spawned: 'qan-api2' with pid 334
2023-07-12 08:26:16,014 INFO success: qan-api2 entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2023-07-12 08:26:18,382 INFO exited: pmm-update-perform-init (exit status 0; expected)
2023-07-13 12:51:21,753 INFO exited: grafana (exit status 2; not expected)
2023-07-13 12:51:21,910 INFO spawned: 'grafana' with pid 41886
2023-07-13 12:51:22,909 INFO success: grafana entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2023-07-13 12:56:04,945 INFO exited: grafana (exit status 2; not expected)
2023-07-13 12:56:04,946 INFO spawned: 'grafana' with pid 51974
2023-07-13 12:56:05,995 INFO success: grafana entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2023-07-13 13:00:56,710 INFO exited: grafana (exit status 2; not expected)
2023-07-13 13:00:56,712 INFO spawned: 'grafana' with pid 62062
2023-07-13 13:00:57,712 INFO success: grafana entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2023-07-13 13:12:26,967 INFO exited: grafana (exit status 2; not expected)
2023-07-13 13:12:26,996 INFO spawned: 'grafana' with pid 72243
2023-07-13 13:12:28,008 INFO success: grafana entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2023-07-13 13:18:00,872 INFO exited: grafana (exit status 2; not expected)
2023-07-13 13:18:00,903 INFO spawned: 'grafana' with pid 82387
2023-07-13 13:18:01,903 INFO success: grafana entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2023-07-13 13:23:05,198 INFO exited: grafana (exit status 2; not expected)
2023-07-13 13:23:05,205 INFO spawned: 'grafana' with pid 92519
2023-07-13 13:23:06,205 INFO success: grafana entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2023-07-13 13:28:15,922 INFO exited: grafana (exit status 2; not expected)
2023-07-13 13:28:15,924 INFO spawned: 'grafana' with pid 102595
...

Expected Result:

PMM pod starts and launches Grafana without stopping all the clients.

Actual Result:

PMM pod doesn’t start and doesn’t launch Grafana until you stop the pmm-client.service on all clients and restart the PMM pod in Kubernetes.

Additional Information:

Previously, we were using an older version of the client (pmm2-client-2.4.0-6.bionic) and PMM Grafana (image: percona/pmm-server:2.1) with the old approach. The client was installed as an apt package, and we didn’t encounter such issues.

Ivan_Piskunov · July 17, 2023, 6:59am

Changed Metric Resolution to 30/60/90; gave 11 CPU, and seems that it worked.

But still have to idea, why once per 2-3 days it have spike with high CPU utilization.

Roma_Novikov · July 17, 2023, 1:15pm

@Ivan_Piskunov, thanks for letting us know what helped in your case. Also, the symptoms related to some we can see in bugs related to [PMM-4466] Migrate Grafana from using SQLite to PostgreSQL - Percona JIRA.
So, in 2.40, we plan to do such a migration, which can increase the concurrency of PMM. Currently, it’s a problem related to SQLite inside.

Ivan_Piskunov · July 17, 2023, 1:48pm

Hello @Roma_Novikov,
Thank you for reply, will look forward with that fix!

Topic		Replies	Views
After upgrade to 2.39.0, the grafana takes too much CPU PMM 2.x	6	1342	September 27, 2023
After Adding 100 Servers PMM Load is Too High	15	1177	August 25, 2023
When i monitor 149 systems ,102 DB instance ， the page of grafana'web met some error PMM 1.x	5	577	September 20, 2018
PMM client restarts PMM 2.x pmm , closed-no-reply	0	621	July 24, 2023
Cannot start PMM Server v2.40.0 due to error PMM 2.x	3	821	October 11, 2023