After upgrade to 2.39.0, the grafana takes too much CPU

Description:

after upgrade to 2.39.0 from 2.37.1, whenever refreshing the dashboard, the system load increasing dramatically, and the cpu usage for the grafana component is very high, and cause the web UI stuck and returns error due to it can not get data from pmm-server in time.

Steps to Reproduce:

upgrade to 2.39.0 and this appears

Version:

2.39.0 and OS version is Centos 7.9 (pmm-server), Centos 6.10(pmm-agent)

Logs:

Expected Result:

The pmm-server works normally

Hi @Luke03011 .
It appears that your issues are similar to those reported on [PMM-12415] PMM dashboard has a high CPU load and UI is unresponsive after adding 100 Servers. - Percona JIRA.
We are looking at some potential fixes.
Please note that your client is running on an unsupported operating system (CentOS 6), which may cause issues.

updates:
after upgrading the pmm-server to 2.39.1 I do try to add a bunch of MariaDB servers to it, I think it’s not too big, e.g. ~ 90 servers, and I use ansible to do this and I found start from the 79th server, I can not register the MariaDB service to pmm-server anymore. The ansible playbook reminds me of Internal server error.
Then I found the grafana issue as posted, high CPU, high system load(can up to 800, insane), I looked into the logs of the grafana component, and found the following logs:

logger=context t=2023-09-26T09:51:42.600647342Z level=error msg="invalid API key" error="context canceled" traceID=
logger=context t=2023-09-26T09:51:42.600671693Z level=error msg="invalid API key" error="database is locked" traceID=
logger=context t=2023-09-26T09:51:42.600662537Z level=error msg="invalid username or password" error="database is locked" traceID=
logger=context userId=0 orgId=0 uname= t=2023-09-26T09:51:42.601109018Z level=error msg="Request Completed" method=GET path=/api/auth/key status=500 remote_addr=127.0.0.1 time_ms=22580 duration=22.580653301s size=67 referer=

It seems the grafana’s sqlite3 database is locked(And I suspect it related to the above concurrently register operations).

I tried restart the pmm-server docker, reboot the server hosting the docker, none of them works.

I search google, and found a lot of similar cases. I tried to backup the sqlite3 database and use the backup to replace the original one, but still no luck:

# First stop the grafana
supervisorctl stop grafana
# backup the database to a new file
cd /srv/granfa
sqlite3 grafana.db '.backup grafana-new.db'
# switch the two files
mv grafana.db grafana-old.db && mv grafana-new.db grafana.db

We use both Centos6.10 and Centos 7.9, and before upgrading to 2.39.0, they work fine.

update:
The high load seems to be caused by invalid api key(username and password) from pmm-client side, I check the logs on the pmm-client side:

INFO[2023-09-27T01:17:51.297+00:00] 2023-09-27T01:17:51.297Z	error	VictoriaMetrics/app/vmagent/remotewrite/client.go:422	unexpected status code received after sending a block with size 40436 bytes to "1:secret-url" during retry #21: 401; response body="{\"code\":16,\"error\":\"invalid API key\",\"message\":\"invalid API key\"}\n"; re-sending the block in 60.000 seconds  agentID=/agent_id/e19f9ad4-b27d-4fba-a821-1fb7302675a8 component=agent-process type=vm_agent
INFO[2023-09-27T01:17:51.299+00:00] 2023-09-27T01:17:51.299Z	error	VictoriaMetrics/app/vmagent/remotewrite/client.go:422	unexpected status code received after sending a block with size 40449 bytes to "1:secret-url" during retry #21: 401; response body="{\"code\":16,\"error\":\"invalid API key\",\"message\":\"invalid API key\"}\n"; re-sending the block in 60.000 seconds  agentID=/agent_id/e19f9ad4-b27d-4fba-a821-1fb7302675a8 component=agent-process type=vm_agent

I do not know why after upgrading the pmm-client, it can not communicate with pmm-server correctly.

By the way, the database locked issue disappears after I add the following option to /etc/grafana/grafana.ini:

[database]
connection_string=file:/srv/grafana/grafana.db?cache=private&mode=rwc&_journal_mode=WAL

And after remove all pmm clients and reinstalled them, finally it return to normal.

There might be other component updates during the 2.39 that caused the problem. We are not checking new builds on Centos 6.

@Luke03011, thanks for the updates. You are currently experiencing an issue with SQLite, but rest assured that we have already found a solution in pmm 2.40, which is planned to be released next week.

1 Like