After Adding 100 Servers PMM Load is Too High

Hi Team,

After adding 100 servers to PMM, we are seeing a high CPU load, and the GUI is not responsive. Can someone please help us resolve this issue?

image

hello, Could you please share the hardware specs of the pmm server host. and what kind of instances have you added to the PMM, are they rds / ec2 ? and could please share the graph from pmm dash board of the pmm server server itself thanks

Hi @Mughees_Ahmed

Thanks for the reply.

Installation Type: Docker
CPU: 16
RAM: 32 GB
OS: Rocky Linux 8
MariaDB DB servers: All are on-premise Linux servers
Service Running: All are MariaDB servers.

I am unable to check PMM dashboards, we are seeing only this page.

image

or

I have noticed the below errors in Grafana’s log.

  1. error=“context canceled” traceID=
  2. error=“database is locked” traceID=

logger=context t=2023-08-11T04:34:35.701564308Z level=error msg=“invalid API key” error=“context canceled” traceID=
logger=context userId=0 orgId=0 uname= t=2023-08-11T04:34:35.701966774Z level=error msg=“Request Completed” method=GET path=/api/auth/key status=500 remote_addr=127.0.0.1 time_ms=5744 duration=5.744224386s size=67 referer=
logger=context userId=0 orgId=1 uname= t=2023-08-11T04:34:35.70263873Z level=error msg= error=“context canceled” traceID=
logger=context userId=0 orgId=1 uname= t=2023-08-11T04:34:35.702951532Z level=info msg=“Request Completed” method=GET path=/api/auth/key status=403 remote_addr=127.0.0.1 time_ms=5043 duration=5.04309565s size=39 referer=
logger=context userId=0 orgId=1 uname= t=2023-08-11T04:34:35.704109717Z level=error msg= error=“database is locked” traceID=
logger=context userId=0 orgId=1 uname= t=2023-08-11T04:34:35.704318126Z level=warn msg=“failed to update last use date for api key” id=79
logger=context userId=0 orgId=1 uname= t=2023-08-11T04:34:35.704478596Z level=info msg=“Request Completed” method=GET path=/api/auth/key status=403 remote_addr=127.0.0.1 time_ms=6658 duration=6.658097051s size=39 referer=
logger=context userId=0 orgId=1 uname= t=2023-08-11T04:34:35.731380264Z level=warn msg=“failed to update last use date for api key” id=61
logger=context t=2023-08-11T04:34:35.749897242Z level=error msg=“invalid API key” error=“database is locked” traceID=
logger=context userId=0 orgId=0 uname= t=2023-08-11T04:34:35.750253504Z level=error msg=“Request Completed” method=GET path=/api/auth/key status=500 remote_addr=127.0.0.1 time_ms=5005 duration=5.005339383s size=67 referer=
logger=context t=2023-08-11T04:34:35.752378549Z level=error msg=“invalid API key” error=“database is locked” traceID=
logger=context userId=0 orgId=0 uname= t=2023-08-11T04:34:35.752764502Z level=error msg=“Request Completed” method=GET path=/api/auth/key status=500 remote_addr=127.0.0.1 time_ms=5007 duration=5.007008897s size=67 referer=
logger=context t=2023-08-11T04:34:35.756589534Z level=error msg=“invalid API key” error=“database is locked” traceID=
logger=context userId=0 orgId=0 uname= t=2023-08-11T04:34:35.756952678Z level=error msg=“Request Completed” method=GET path=/api/auth/key status=500 remote_addr=127.0.0.1 time_ms=5009 duration=5.009489243s size=67 referer=
logger=context t=2023-08-11T04:34:35.761626423Z level=error msg=“invalid API key” error=“database is locked” traceID=
logger=context userId=0 orgId=0 uname= t=2023-08-11T04:34:35.761950738Z level=error msg=“Request Completed” method=GET path=/api/auth/key status=500 remote_addr=127.0.0.1 time_ms=5016 duration=5.016206214s size=67 referer=
logger=context t=2023-08-11T04:34:35.783436Z level=error msg=“invalid API key” error=“context canceled” traceID=
logger=context t=2023-08-11T04:34:35.783599192Z level=error msg=“invalid API key” error=“context canceled” traceID=
logger=context userId=0 orgId=0 uname= t=2023-08-11T04:34:35.783839591Z level=error msg=“Request Completed” method=GET path=/api/auth/key status=500 remote_addr=127.0.0.1 time_ms=4132 duration=4.132619051s size=67 referer=
logger=context userId=0 orgId=0 uname= t=2023-08-11T04:34:35.78394484Z level=error msg=“Request Completed” method=GET path=/api/auth/key status=500 remote_addr=127.0.0.1 time_ms=4462 duration=4.462103618s size=67 referer=
logger=context t=2023-08-11T04:34:35.784610207Z level=error msg=“invalid API key” error=“context canceled” traceID=
logger=context t=2023-08-11T04:34:35.784765303Z level=error msg=“invalid API key” error=“context canceled” traceID=
logger=context userId=0 orgId=0 uname= t=2023-08-11T04:34:35.784943966Z level=error msg=“Request Completed” method=GET path=/api/auth/key status=500 remote_addr=127.0.0.1 time_ms=4464 duration=4.464013022s size=67 referer=
logger=context userId=0 orgId=0 uname= t=2023-08-11T04:34:35.785122005Z level=error msg=“Request Completed” method=GET path=/api/auth/key status=500 remote_addr=127.0.0.1 time_ms=4951 duration=4.951649648s size=67

We also encountered same issue. It seems to be related to the fact that sqlite database is used for grafana and probably will be fixed with next release: [PMM-4466] Migrate Grafana from using SQLite to PostgreSQL - Percona JIRA , https://jira.percona.com/projects/PMM/issues/PMM-12173

One workaraound that helped us was to enable “wal” in sqlite like this:

If running pmm with docker, then:

docker exec -it CONTAINER_NAME bash
cd to grafana dir
and execute this:
sqlite3 grafana.db 'pragma journal_mode=wal;'

after that restart grafana (inside container: supervisorctl restart grafana) or whole pmm container.

@Lauri Thanks for the details.

How many nodes did you add to the PMM server?

we have added ~ 500 nodes

@Lauri I’m not sure, but why are we seeing too much high CPU load because of the SQLite DB locks?

So after execution of the below command, the CPU load has come down for you, or is CPU still high?

– sqlite3 grafana.db ‘pragma journal_mode=wal;’

I do not know, why load went up because of sqlite locks, but using ‘wal’ definitely helped: no “db locked” messages anymore in logs, and PMM server cpu load also went down after some time.

@Lauri But for me, issues still exist. Again, I am trying to add the servers to PMM, and again, I am facing the same issue. :smiling_face_with_tear:

Do you still see “database is locked” errors in grafana log?

And, if you upgraded PMM meanwhile, then maybe you have to use this workaround command again for sqlite?

Yeah @Lauri , but after upgrade only I set it to Wal.

bash-5.1# sqlite3 grafana.db ‘PRAGMA journal_mode;’
wal
bash-5.1# supervisorctl restart grafana
grafana: stopped
grafana: started
bash-5.1#

@Lauri I can still see the database locks in the Grafana log file.

@Lauri Can you please share with me the below details as per your PMM server.

  1. CPU
  2. RAM
  3. Physical server or VM Server
  4. PMM server version
  5. Data retention: 30 days or 60 days?
  6. Slow log enabled?
  7. Table stats enabled?
  8. OS
  9. Any system resources, please share me?

Hey @Naresh9999 ! have you checked this comment: Database is Locked pmm 2.38.1 - #25 by meyerder ? Maybe you have also different grafanas on container?

But our PMM server is on VM, ver 2.38.1, CPU: 16, RAM:64G, retention 30 days, Slow logs enabled, table stats disabled, OS: ubuntu 20

@Lauri not sure, let me check it.

Thanks for all the details.