Description:
after upgrade to 2.39.0
from 2.37.1
, whenever refreshing the dashboard, the system load increasing dramatically, and the cpu usage for the grafana component is very high, and cause the web UI stuck and returns error due to it can not get data from pmm-server in time.
Steps to Reproduce:
upgrade to 2.39.0 and this appears
Version:
2.39.0
and OS version is Centos 7.9 (pmm-server), Centos 6.10(pmm-agent)
Logs:
Expected Result:
The pmm-server works normally
Hi @Luke03011 .
It appears that your issues are similar to those reported on [PMM-12415] PMM dashboard has a high CPU load and UI is unresponsive after adding 100 Servers. - Percona JIRA.
We are looking at some potential fixes.
Please note that your client is running on an unsupported operating system (CentOS 6), which may cause issues.
updates:
after upgrading the pmm-server to 2.39.1
I do try to add a bunch of MariaDB servers to it, I think it’s not too big, e.g. ~ 90 servers, and I use ansible to do this and I found start from the 79th server, I can not register the MariaDB service to pmm-server anymore. The ansible playbook reminds me of Internal server error
.
Then I found the grafana issue as posted, high CPU, high system load(can up to 800, insane), I looked into the logs of the grafana component, and found the following logs:
logger=context t=2023-09-26T09:51:42.600647342Z level=error msg="invalid API key" error="context canceled" traceID=
logger=context t=2023-09-26T09:51:42.600671693Z level=error msg="invalid API key" error="database is locked" traceID=
logger=context t=2023-09-26T09:51:42.600662537Z level=error msg="invalid username or password" error="database is locked" traceID=
logger=context userId=0 orgId=0 uname= t=2023-09-26T09:51:42.601109018Z level=error msg="Request Completed" method=GET path=/api/auth/key status=500 remote_addr=127.0.0.1 time_ms=22580 duration=22.580653301s size=67 referer=
It seems the grafana’s sqlite3
database is locked(And I suspect it related to the above concurrently register operations).
I tried restart the pmm-server docker, reboot the server hosting the docker, none of them works.
I search google, and found a lot of similar cases. I tried to backup the sqlite3
database and use the backup to replace the original one, but still no luck:
# First stop the grafana
supervisorctl stop grafana
# backup the database to a new file
cd /srv/granfa
sqlite3 grafana.db '.backup grafana-new.db'
# switch the two files
mv grafana.db grafana-old.db && mv grafana-new.db grafana.db
We use both Centos6.10 and Centos 7.9, and before upgrading to 2.39.0
, they work fine.
update:
The high load seems to be caused by invalid api key(username and password) from pmm-client
side, I check the logs on the pmm-client
side:
INFO[2023-09-27T01:17:51.297+00:00] 2023-09-27T01:17:51.297Z error VictoriaMetrics/app/vmagent/remotewrite/client.go:422 unexpected status code received after sending a block with size 40436 bytes to "1:secret-url" during retry #21: 401; response body="{\"code\":16,\"error\":\"invalid API key\",\"message\":\"invalid API key\"}\n"; re-sending the block in 60.000 seconds agentID=/agent_id/e19f9ad4-b27d-4fba-a821-1fb7302675a8 component=agent-process type=vm_agent
INFO[2023-09-27T01:17:51.299+00:00] 2023-09-27T01:17:51.299Z error VictoriaMetrics/app/vmagent/remotewrite/client.go:422 unexpected status code received after sending a block with size 40449 bytes to "1:secret-url" during retry #21: 401; response body="{\"code\":16,\"error\":\"invalid API key\",\"message\":\"invalid API key\"}\n"; re-sending the block in 60.000 seconds agentID=/agent_id/e19f9ad4-b27d-4fba-a821-1fb7302675a8 component=agent-process type=vm_agent
I do not know why after upgrading the pmm-client
, it can not communicate with pmm-server
correctly.
By the way, the database locked
issue disappears after I add the following option to /etc/grafana/grafana.ini
:
[database]
connection_string=file:/srv/grafana/grafana.db?cache=private&mode=rwc&_journal_mode=WAL
And after remove all pmm clients and reinstalled them, finally it return to normal.
There might be other component updates during the 2.39 that caused the problem. We are not checking new builds on Centos 6.
@Luke03011, thanks for the updates. You are currently experiencing an issue with SQLite, but rest assured that we have already found a solution in pmm 2.40, which is planned to be released next week.
1 Like