I do see that I have a bunch of invalid API keys and it appears that the API key validation is failing when this happens. I started fresh with no containers/images and used the pmm.sh --interactive script When this happens grafana CPU usage spikes and when attempting to pull up the page I get a error message… traceid is displayed (empty value) .
When I see it top shows grafana at CPU > 100% in top (sometimes spikes to 200% which Grafana Should not be doing that)
Tasks: 404 total, 2 running, 402 sleeping, 0 stopped, 0 zombie
%Cpu(s): 5.8 us, 1.8 sy, 0.0 ni, 92.1 id, 0.1 wa, 0.0 hi, 0.2 si, 0.0 st
KiB Mem : 32761424 total, 24225252 free, 6363208 used, 2172964 buff/cache
KiB Swap: 2097148 total, 2097148 free, 0 used. 25988564 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
40224 libstor+ 20 0 17.1g 453956 37988 S 143.6 1.4 915:51.86 grafana server --homepath=/usr/share/grafana --config=/etc/grafana/grafana.ini cfg:default.paths.data=/srv/grafana cfg:default.paths.plugins=/srv/grafana/plugins cfg:default.paths.logs=/srv/logs cfg:default.log.mode=console cfg:default.log.console.format=console cfg:default.server.root_url=https://%(domain)s/graph
1332 1000 20 0 7079348 3.0g 323504 S 21.8 9.6 839:43.09 /usr/sbin/victoriametrics --promscrape.config=/etc/victoriametrics-promscrape.yml --retentionPeriod=90d --storageDataPath=/srv/victoriametrics/data --httpListenAddr=127.0.0.1:9090 --search.disableCache=true --search.maxQueryLen=1MB --search.latencyOffset=5s --search.maxUniqueTimeseries=100000000 --search.maxSamples+
3444 root 20 0 1147320 30076 14216 S 5.9 0.1 67:06.06 /opt/OV/bin/oacore oacore /var/opt/OV/conf/oa/PipeDefinitions/oacore.xml
40238 root 20 0 5738552 765164 28920 S 4.0 2.3 112:34.88 /usr/sbin/pmm-managed --victoriametrics-config=/etc/victoriametrics-promscrape.yml --victoriametrics-url=http://127.0.0.1:9090/prometheus --supervisord-config-dir=/etc/supervisord.d
40279 polkitd 20 0 85232 73904 14012 S 4.0 0.2 94:17.28 nginx: worker process
7007a8dd1535 percona/pmm-server:2.38.1 "/opt/entrypoint.sh" 2 days ago Up 2 days (healthy) 80/tcp, 0.0.0.0:443->443/tcp pmm-server
I am watching about 80 systems 35 DB’s and running 5-6 of the template monitoring values Various Stats (Of when it happened I know the times are not all the same but I see items like this)
Inside the docker container in the grafana.log file I see this…
logger=context t=2023-08-10T13:54:23.628491498Z level=error msg="invalid API key" error="database is locked" traceID=
logger=context t=2023-08-10T13:54:23.63546723Z level=error msg="invalid API key" error="database is locked" traceID=
logger=context t=2023-08-10T13:54:23.95790173Z level=error msg="invalid API key" error="database is locked" traceID=
logger=context t=2023-08-10T13:54:24.698356168Z level=error msg="invalid API key" error="database is locked" traceID=
logger=context t=2023-08-10T13:54:24.698608952Z level=error msg="invalid API key" error="database is locked" traceID=
logger=context t=2023-08-10T13:54:24.698903676Z level=error msg="invalid API key" error="database is locked" traceID=
logger=context userId=0 orgId=1 uname= t=2023-08-10T13:54:24.699320221Z level=error msg= error="database is locked" traceID=
logger=context t=2023-08-10T13:54:24.759263952Z level=error msg="invalid API key" error="database is locked" traceID=
logger=context t=2023-08-10T13:54:24.959405963Z level=error msg="invalid API key" error="database is locked" traceID=
logger=context t=2023-08-10T13:54:25.231586104Z level=error msg="invalid API key" error="database is locked" traceID=
logger=context t=2023-08-10T13:54:25.650994276Z level=error msg="invalid API key" error="database is locked" traceID=
logger=context t=2023-08-10T13:54:25.672431119Z level=error msg="invalid API key" error="database is locked" traceID=
logger=context t=2023-08-10T13:54:25.672727284Z level=error msg="invalid API key" error="database is locked" traceID=
logger=context t=2023-08-10T13:54:25.672899664Z level=error msg="invalid API key" error="database is locked" traceID=
logger=context t=2023-08-10T13:54:25.680013675Z level=error msg="invalid API key" error="database is locked" traceID=
logger=context t=2023-08-10T13:54:25.680299429Z level=error msg="invalid API key" error="database is locked" traceID=
logger=context t=2023-08-10T13:54:25.680677485Z level=error msg="invalid API key" error="database is locked" traceID=
logger=context t=2023-08-10T13:54:25.759563452Z level=error msg="invalid API key" error="database is locked" traceID=
logger=context t=2023-08-10T13:54:25.783932722Z level=error msg="invalid API key" error="database is locked" traceID=
logger=context t=2023-08-10T13:54:28.648180339Z level=error msg="invalid API key" error="database is locked" traceID=
logger=context t=2023-08-10T13:54:28.981170617Z level=error msg="invalid API key" error="database is locked" traceID=
logger=context t=2023-08-10T13:54:28.987501614Z level=error msg="invalid API key" error="database is locked" traceID=
logger=context t=2023-08-10T13:54:29.102181337Z level=error msg="invalid API key" error="database is locked" traceID=
On the host system I get
dockerd: time="2023-08-10T09:54:02.540362887-04:00" level=warning msg="Health check for container 7007a8dd1535185db9bfd1c17c8dca68ca7da7d11e337be0b6ea87cd5ebc792e error: context deadline exceeded"
**Note the host system is sitting there not doing much until the spike happens
CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
7007a8dd1535 pmm-server 52.77% 4.668GiB / 31.24GiB 14.94% 140GB / 6.17GB 34.4GB / 391GB 1556
Linux 3.10.0-1160.92.1.el7.x86_64 (xxxxx) 08/10/2023 _x86_64_ (12 CPU)
12:00:02 AM CPU %user %nice %system %iowait %steal %idle
12:10:01 AM all 6.40 0.00 1.97 0.19 0.00 91.45
12:20:01 AM all 6.22 0.00 1.92 0.17 0.00 91.69
12:30:01 AM all 5.90 0.00 1.86 0.15 0.00 92.09
12:40:01 AM all 5.81 0.00 1.82 0.14 0.00 92.23
12:50:01 AM all 5.92 0.00 1.83 0.16 0.00 92.09
01:00:01 AM all 5.95 0.00 1.85 0.18 0.00 92.02
01:10:01 AM all 6.58 0.00 2.07 0.23 0.00 91.12
01:20:01 AM all 6.02 0.00 1.91 0.15 0.00 91.92
01:30:01 AM all 5.70 0.00 1.80 0.15 0.00 92.36
01:40:01 AM all 5.76 0.00 1.80 0.15 0.00 92.29
01:50:01 AM all 6.23 0.00 1.88 0.15 0.00 91.74
02:00:02 AM all 5.96 0.00 1.87 0.18 0.00 91.99
02:10:01 AM all 6.29 0.00 1.95 0.22 0.00 91.54
02:20:01 AM all 5.82 0.00 1.83 0.15 0.00 92.19
02:30:01 AM all 5.97 0.00 1.86 0.16 0.00 92.01
02:40:01 AM all 5.93 0.00 1.87 0.20 0.00 92.00
02:50:01 AM all 6.00 0.00 1.88 0.16 0.00 91.95
03:00:01 AM all 6.09 0.00 1.88 0.16 0.00 91.87
03:10:01 AM all 6.55 0.00 2.04 0.24 0.00 91.17
03:20:01 AM all 5.88 0.00 1.90 0.16 0.00 92.05
03:30:01 AM all 5.79 0.00 1.82 0.15 0.00 92.24
03:40:01 AM all 5.79 0.00 1.82 0.15 0.00 92.24
03:50:01 AM all 5.70 0.01 1.81 0.16 0.00 92.32
04:00:01 AM all 6.11 0.00 1.83 0.17 0.00 91.89
04:10:01 AM all 6.13 0.00 1.91 0.22 0.00 91.74
04:20:01 AM all 5.66 0.00 1.79 0.15 0.00 92.40
04:30:01 AM all 5.65 0.00 1.76 0.14 0.00 92.45
04:40:01 AM all 5.71 0.00 1.78 0.19 0.00 92.32
04:50:01 AM all 5.64 0.00 1.76 0.15 0.00 92.46
05:00:01 AM all 5.56 0.00 1.76 0.16 0.00 92.52
05:10:01 AM all 6.15 0.00 1.84 0.20 0.00 91.80