Description:
Dear Percona community,
I see the strange issues in PMM-server 2.39. Sometimes I see tons of messages like this:
logger=context t=2023-12-04T14:34:40.147436769Z level=error msg="invalid API key" error="invalid API key" traceID=
logger=context userId=0 orgId=0 uname= t=2023-12-04T14:34:40.148321642Z level=info msg="Request Completed" method=GET path=/api/auth/key status=401 remote_addr=127.0.0.1 time_ms=1 duration=1.154743ms size=67 referer=
When it happens, Grafana is out of service as it can’t handle the frequent recurring requests from the agents. top
shows that Grafana consumes 100 to 200 %CPU
. To mitigate the issues I had to edit the configuration file /usr/local/percona/pmm2/config/pmm-agent.yaml and change .server.password to plain PMM admin password instead of base64-encoded API KEY. That had to be done for each of about 150 agents I have.
This kind of incident repeated several times with new agents I registered. I couldn’t find what caused the API keys to go invalid, but I would like to prevent those situations from happening in the future.
So could you please help me answer the following questions?
- What can cause Grafana API tokens to go invalid?
- Which metric expressions could be used for alerting on the cases when pmm-agent has an invalid API key and is unable to connect to the server?
- Should I completely give up using API keys and get back to using the plain admin password in pmm-agent configuration?
Steps to Reproduce:
- Deploy pmm-server into Kubernetes with
/srv
directory mounted to persistent storage. - Register multiple agents with
/usr/sbin/pmm-admin config --server-url "https://${PMM_USERNAME}:${PMM_PASSWORD}@${PMM_HOST}:${PMM_PORT}" --force
command. - Unclear what happens next.
Version:
pmm-server 2.39
pmm-agent 2.39