Description:
Like other users, we are also facing issues WRT SQLlite and graphana in mysql PMM 2.39 version, where PMM is getting crashed randomly with error 503 service unavailable followed by some traceID error. We found a lot of other clients are facing the same issue, the issue has not reappeared post we changed the journal_mode for SQLLite to WAL, and we are monitoring the same.
Below are the links for reference:
Percona issue link: Database is Locked pmm 2.38.1 - #8 by Naresh9999
After Adding 100 Servers PMM Load is Too High
After upgrade to 2.39.0, the grafana takes too much CPU - #5 by Luke03011
Percona jira link: [PMM-12415] PMM dashboard has a high CPU load and UI is unresponsive after adding 100 Servers. - Percona JIRA
Steps to Reproduce:
When the total number of mysql servers onboarding crossed > 110, we started seeing the issue.
Version:
PMM 2.39
Logs:
logger=context userId=0 orgId=0 uname= t=2023-09-28T10:31:53.778351685Z level=error msg=“Request Completed” method=GET path=/api/auth/key status=500 remote_addr=127.0.0.1 time_ms=8160 duration=8.160297511s size=67 referer=
logger=context t=2023-09-28T10:31:53.778486655Z level=error msg=“invalid API key” error=“context canceled” traceID=
logger=context userId=0 orgId=0 uname= t=2023-09-28T10:31:53.778665025Z level=error msg=“Request Completed” method=GET path=/api/auth/key status=500 remote_addr=127.0.0.1 time_ms=5031 duration=5.031578726s size=67 referer=
logger=context userId=0 orgId=1 uname= t=2023-09-28T10:31:53.778753786Z level=warn msg=“failed to update last use date for api key” id=94
logger=context userId=0 orgId=1 uname= t=2023-09-28T10:31:53.779290502Z level=error msg= error=“context canceled” traceID=
logger=context userId=0 orgId=1 uname= t=2023-09-28T10:31:53.779452903Z level=info msg=“Request Completed” method=GET path=/api/auth/key status=403 remote_addr=127.0.0.1 time_ms=8177 duration=8.177843464s size=39 referer=
size=39 referer=
logger=context userId=0 orgId=1 uname= t=2023-09-28T10:31:53.661641609Z level=warn msg=“failed to update last use date for api key” id=31
logger=context t=2023-09-28T10:31:53.662593353Z level=error msg=“invalid API key” error=“database is locked” traceID=
logger=context userId=0 orgId=0 uname= t=2023-09-28T10:31:53.66277235Z level=error msg=“Request Completed” method=GET path=/api/auth/key status=500 remote_addr=127.0.0.1 time_ms=5005 duration=5.005612613s size=67 referer=
Expected Result:
As per PMM recommendation, PMM should ideally never crash with 1000+ onboardings.
We have a huge HW with 32 core CPU, 128GB RAM and 6TB SSD.
Questions
When PMM 2.40 will be released?
Will PMM 2.40 have a HA feature?
Will the above crash issue be fixed in the PMM 2.40 version?
As we already have 100+ clients onboarded to PMM 2.39 with a lot of data which is configured in a VM, in case we are upgrading to PMM 2.40 to fix the above issue how to migrate existing data from 2.30 to 2.40?
In case we upgrade PMM server 2.39 to 2.40, is it mandatory to upgrade all the PMM clients to 2.40 as well OR PMM server 2.40 is compatible with older client versions?