Grafana API keys go invalid intermittently

Description:

Dear Percona community,
I see the strange issues in PMM-server 2.39. Sometimes I see tons of messages like this:

logger=context t=2023-12-04T14:34:40.147436769Z level=error msg="invalid API key" error="invalid API key" traceID=
logger=context userId=0 orgId=0 uname= t=2023-12-04T14:34:40.148321642Z level=info msg="Request Completed" method=GET path=/api/auth/key status=401 remote_addr=127.0.0.1 time_ms=1 duration=1.154743ms size=67 referer=

When it happens, Grafana is out of service as it can’t handle the frequent recurring requests from the agents. top shows that Grafana consumes 100 to 200 %CPU. To mitigate the issues I had to edit the configuration file /usr/local/percona/pmm2/config/pmm-agent.yaml and change .server.password to plain PMM admin password instead of base64-encoded API KEY. That had to be done for each of about 150 agents I have.

This kind of incident repeated several times with new agents I registered. I couldn’t find what caused the API keys to go invalid, but I would like to prevent those situations from happening in the future.
So could you please help me answer the following questions?

  • What can cause Grafana API tokens to go invalid?
  • Which metric expressions could be used for alerting on the cases when pmm-agent has an invalid API key and is unable to connect to the server?
  • Should I completely give up using API keys and get back to using the plain admin password in pmm-agent configuration?

Steps to Reproduce:

  1. Deploy pmm-server into Kubernetes with /srv directory mounted to persistent storage.
  2. Register multiple agents with /usr/sbin/pmm-admin config --server-url "https://${PMM_USERNAME}:${PMM_PASSWORD}@${PMM_HOST}:${PMM_PORT}" --force command.
  3. Unclear what happens next.

Version:

pmm-server 2.39
pmm-agent 2.39

Hello @Grigoriy_Frolov, please update to the latest PMM. This problem was caused because of SQLite used in PMM, we fixed it recently.

  • Which metric expressions could be used for alerting on the cases when pmm-agent has an invalid API key and is unable to connect to the server?
    We have alert template for pmm-agent being down, so you can use it.
  • Should I completely give up using API keys and get back to using the plain admin password in pmm-agent configuration?
    No, please don’t give up and use the latest PMM

Thanks Nurlan, I will update my PMM deployment to 2.40.1 today and watch.