Database is Locked pmm 2.38.1

I have been doing some reading and can not seem to figure it out.
I have a fresh docker image installed. About 70 systems being monitored and about 35 mysql databases.

I am seeing log output of database is locked. I think it is talking about the sqlite db being locked and I see items are being moved to postgres (I thought from reading that 2.38 moved to PG vs sqlite but it appears it has not)

Once the DB goes into a locked state the only way I can find so far is to recycle the docker container.

Has anyone else noticed this or figured out exactly what is causing this? (or suggestions?)

Thank you

Where are you seeing this? i.e. what log are you looking at?

The Sqlite to PG migration was pulled from the last release due to some corner case migration issues.
But the only thing the sqlite db is used for is grafana configuration settings…so for that to be locked I’d have to suspect a runaway API process or perhaps a stale lock that was present at the time of upgrade that wasn’t removed and now the new process can’t remove it…but that’s just wild guesses.

Anymore info you can share?

I do see that I have a bunch of invalid API keys and it appears that the API key validation is failing when this happens. I started fresh with no containers/images and used the pmm.sh --interactive script When this happens grafana CPU usage spikes and when attempting to pull up the page I get a error message… traceid is displayed (empty value) .

When I see it top shows grafana at CPU > 100% in top (sometimes spikes to 200% which Grafana Should not be doing that)

Tasks: 404 total,   2 running, 402 sleeping,   0 stopped,   0 zombie
%Cpu(s):  5.8 us,  1.8 sy,  0.0 ni, 92.1 id,  0.1 wa,  0.0 hi,  0.2 si,  0.0 st
KiB Mem : 32761424 total, 24225252 free,  6363208 used,  2172964 buff/cache
KiB Swap:  2097148 total,  2097148 free,        0 used. 25988564 avail Mem

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 40224 libstor+  20   0   17.1g 453956  37988 S  143.6  1.4 915:51.86 grafana server --homepath=/usr/share/grafana --config=/etc/grafana/grafana.ini cfg:default.paths.data=/srv/grafana cfg:default.paths.plugins=/srv/grafana/plugins cfg:default.paths.logs=/srv/logs cfg:default.log.mode=console cfg:default.log.console.format=console cfg:default.server.root_url=https://%(domain)s/graph
  1332 1000      20   0 7079348   3.0g 323504 S  21.8  9.6 839:43.09 /usr/sbin/victoriametrics --promscrape.config=/etc/victoriametrics-promscrape.yml --retentionPeriod=90d --storageDataPath=/srv/victoriametrics/data --httpListenAddr=127.0.0.1:9090 --search.disableCache=true --search.maxQueryLen=1MB --search.latencyOffset=5s --search.maxUniqueTimeseries=100000000 --search.maxSamples+
  3444 root      20   0 1147320  30076  14216 S   5.9  0.1  67:06.06 /opt/OV/bin/oacore oacore /var/opt/OV/conf/oa/PipeDefinitions/oacore.xml
 40238 root      20   0 5738552 765164  28920 S   4.0  2.3 112:34.88 /usr/sbin/pmm-managed --victoriametrics-config=/etc/victoriametrics-promscrape.yml --victoriametrics-url=http://127.0.0.1:9090/prometheus --supervisord-config-dir=/etc/supervisord.d
 40279 polkitd   20   0   85232  73904  14012 S   4.0  0.2  94:17.28 nginx: worker process


7007a8dd1535   percona/pmm-server:2.38.1   "/opt/entrypoint.sh"   2 days ago   Up 2 days (healthy)   80/tcp, 0.0.0.0:443->443/tcp   pmm-server

I am watching about 80 systems 35 DB’s and running 5-6 of the template monitoring values Various Stats (Of when it happened I know the times are not all the same but I see items like this)

Inside the docker container in the grafana.log file I see this…

logger=context t=2023-08-10T13:54:23.628491498Z level=error msg="invalid API key" error="database is locked" traceID=
logger=context t=2023-08-10T13:54:23.63546723Z level=error msg="invalid API key" error="database is locked" traceID=
logger=context t=2023-08-10T13:54:23.95790173Z level=error msg="invalid API key" error="database is locked" traceID=
logger=context t=2023-08-10T13:54:24.698356168Z level=error msg="invalid API key" error="database is locked" traceID=
logger=context t=2023-08-10T13:54:24.698608952Z level=error msg="invalid API key" error="database is locked" traceID=
logger=context t=2023-08-10T13:54:24.698903676Z level=error msg="invalid API key" error="database is locked" traceID=
logger=context userId=0 orgId=1 uname= t=2023-08-10T13:54:24.699320221Z level=error msg= error="database is locked" traceID=
logger=context t=2023-08-10T13:54:24.759263952Z level=error msg="invalid API key" error="database is locked" traceID=
logger=context t=2023-08-10T13:54:24.959405963Z level=error msg="invalid API key" error="database is locked" traceID=
logger=context t=2023-08-10T13:54:25.231586104Z level=error msg="invalid API key" error="database is locked" traceID=
logger=context t=2023-08-10T13:54:25.650994276Z level=error msg="invalid API key" error="database is locked" traceID=
logger=context t=2023-08-10T13:54:25.672431119Z level=error msg="invalid API key" error="database is locked" traceID=
logger=context t=2023-08-10T13:54:25.672727284Z level=error msg="invalid API key" error="database is locked" traceID=
logger=context t=2023-08-10T13:54:25.672899664Z level=error msg="invalid API key" error="database is locked" traceID=
logger=context t=2023-08-10T13:54:25.680013675Z level=error msg="invalid API key" error="database is locked" traceID=
logger=context t=2023-08-10T13:54:25.680299429Z level=error msg="invalid API key" error="database is locked" traceID=
logger=context t=2023-08-10T13:54:25.680677485Z level=error msg="invalid API key" error="database is locked" traceID=
logger=context t=2023-08-10T13:54:25.759563452Z level=error msg="invalid API key" error="database is locked" traceID=
logger=context t=2023-08-10T13:54:25.783932722Z level=error msg="invalid API key" error="database is locked" traceID=
logger=context t=2023-08-10T13:54:28.648180339Z level=error msg="invalid API key" error="database is locked" traceID=
logger=context t=2023-08-10T13:54:28.981170617Z level=error msg="invalid API key" error="database is locked" traceID=
logger=context t=2023-08-10T13:54:28.987501614Z level=error msg="invalid API key" error="database is locked" traceID=
logger=context t=2023-08-10T13:54:29.102181337Z level=error msg="invalid API key" error="database is locked" traceID=

On the host system I get

dockerd: time="2023-08-10T09:54:02.540362887-04:00" level=warning msg="Health check for container 7007a8dd1535185db9bfd1c17c8dca68ca7da7d11e337be0b6ea87cd5ebc792e error: context deadline exceeded"

**Note the host system is sitting there not doing much until the spike happens

CONTAINER ID   NAME         CPU %     MEM USAGE / LIMIT     MEM %     NET I/O          BLOCK I/O        PIDS
7007a8dd1535   pmm-server   52.77%    4.668GiB / 31.24GiB   14.94%    140GB / 6.17GB   34.4GB / 391GB   1556

Linux 3.10.0-1160.92.1.el7.x86_64 (xxxxx)   08/10/2023      _x86_64_        (12 CPU)

12:00:02 AM     CPU     %user     %nice   %system   %iowait    %steal     %idle
12:10:01 AM     all      6.40      0.00      1.97      0.19      0.00     91.45
12:20:01 AM     all      6.22      0.00      1.92      0.17      0.00     91.69
12:30:01 AM     all      5.90      0.00      1.86      0.15      0.00     92.09
12:40:01 AM     all      5.81      0.00      1.82      0.14      0.00     92.23
12:50:01 AM     all      5.92      0.00      1.83      0.16      0.00     92.09
01:00:01 AM     all      5.95      0.00      1.85      0.18      0.00     92.02
01:10:01 AM     all      6.58      0.00      2.07      0.23      0.00     91.12
01:20:01 AM     all      6.02      0.00      1.91      0.15      0.00     91.92
01:30:01 AM     all      5.70      0.00      1.80      0.15      0.00     92.36
01:40:01 AM     all      5.76      0.00      1.80      0.15      0.00     92.29
01:50:01 AM     all      6.23      0.00      1.88      0.15      0.00     91.74
02:00:02 AM     all      5.96      0.00      1.87      0.18      0.00     91.99
02:10:01 AM     all      6.29      0.00      1.95      0.22      0.00     91.54
02:20:01 AM     all      5.82      0.00      1.83      0.15      0.00     92.19
02:30:01 AM     all      5.97      0.00      1.86      0.16      0.00     92.01
02:40:01 AM     all      5.93      0.00      1.87      0.20      0.00     92.00
02:50:01 AM     all      6.00      0.00      1.88      0.16      0.00     91.95
03:00:01 AM     all      6.09      0.00      1.88      0.16      0.00     91.87
03:10:01 AM     all      6.55      0.00      2.04      0.24      0.00     91.17
03:20:01 AM     all      5.88      0.00      1.90      0.16      0.00     92.05
03:30:01 AM     all      5.79      0.00      1.82      0.15      0.00     92.24
03:40:01 AM     all      5.79      0.00      1.82      0.15      0.00     92.24
03:50:01 AM     all      5.70      0.01      1.81      0.16      0.00     92.32
04:00:01 AM     all      6.11      0.00      1.83      0.17      0.00     91.89
04:10:01 AM     all      6.13      0.00      1.91      0.22      0.00     91.74
04:20:01 AM     all      5.66      0.00      1.79      0.15      0.00     92.40
04:30:01 AM     all      5.65      0.00      1.76      0.14      0.00     92.45
04:40:01 AM     all      5.71      0.00      1.78      0.19      0.00     92.32
04:50:01 AM     all      5.64      0.00      1.76      0.15      0.00     92.46
05:00:01 AM     all      5.56      0.00      1.76      0.16      0.00     92.52
05:10:01 AM     all      6.15      0.00      1.84      0.20      0.00     91.80

@steve.hoffman Even I can notice a heavy CPU load after reaching 60 servers, and the GUI is broken.

I am also noticing the same errors.

logger=context t=2023-08-11T04:34:35.701564308Z level=error msg=“invalid API key” error=“context canceled” traceID=
logger=context userId=0 orgId=0 uname= t=2023-08-11T04:34:35.701966774Z level=error msg=“Request Completed” method=GET path=/api/auth/key status=500 remote_addr=127.0.0.1 time_ms=5744 duration=5.744224386s size=67 referer=
logger=context userId=0 orgId=1 uname= t=2023-08-11T04:34:35.70263873Z level=error msg= error=“context canceled” traceID=
logger=context userId=0 orgId=1 uname= t=2023-08-11T04:34:35.702951532Z level=info msg=“Request Completed” method=GET path=/api/auth/key status=403 remote_addr=127.0.0.1 time_ms=5043 duration=5.04309565s size=39 referer=
logger=context userId=0 orgId=1 uname= t=2023-08-11T04:34:35.704109717Z level=error msg= error=“database is locked” traceID=
logger=context userId=0 orgId=1 uname= t=2023-08-11T04:34:35.704318126Z level=warn msg=“failed to update last use date for api key” id=79
logger=context userId=0 orgId=1 uname= t=2023-08-11T04:34:35.704478596Z level=info msg=“Request Completed” method=GET path=/api/auth/key status=403 remote_addr=127.0.0.1 time_ms=6658 duration=6.658097051s size=39 referer=
logger=context userId=0 orgId=1 uname= t=2023-08-11T04:34:35.731380264Z level=warn msg=“failed to update last use date for api key” id=61
logger=context t=2023-08-11T04:34:35.749897242Z level=error msg=“invalid API key” error=“database is locked” traceID=
logger=context userId=0 orgId=0 uname= t=2023-08-11T04:34:35.750253504Z level=error msg=“Request Completed” method=GET path=/api/auth/key status=500 remote_addr=127.0.0.1 time_ms=5005 duration=5.005339383s size=67 referer=
logger=context t=2023-08-11T04:34:35.752378549Z level=error msg=“invalid API key” error=“database is locked” traceID=
logger=context userId=0 orgId=0 uname= t=2023-08-11T04:34:35.752764502Z level=error msg=“Request Completed” method=GET path=/api/auth/key status=500 remote_addr=127.0.0.1 time_ms=5007 duration=5.007008897s size=67 referer=
logger=context t=2023-08-11T04:34:35.756589534Z level=error msg=“invalid API key” error=“database is locked” traceID=
logger=context userId=0 orgId=0 uname= t=2023-08-11T04:34:35.756952678Z level=error msg=“Request Completed” method=GET path=/api/auth/key status=500 remote_addr=127.0.0.1 time_ms=5009 duration=5.009489243s size=67 referer=
logger=context t=2023-08-11T04:34:35.761626423Z level=error msg=“invalid API key” error=“database is locked” traceID=
logger=context userId=0 orgId=0 uname= t=2023-08-11T04:34:35.761950738Z level=error msg=“Request Completed” method=GET path=/api/auth/key status=500 remote_addr=127.0.0.1 time_ms=5016 duration=5.016206214s size=67 referer=
logger=context t=2023-08-11T04:34:35.783436Z level=error msg=“invalid API key” error=“context canceled” traceID=
logger=context t=2023-08-11T04:34:35.783599192Z level=error msg=“invalid API key” error=“context canceled” traceID=
logger=context userId=0 orgId=0 uname= t=2023-08-11T04:34:35.783839591Z level=error msg=“Request Completed” method=GET path=/api/auth/key status=500 remote_addr=127.0.0.1 time_ms=4132 duration=4.132619051s size=67 referer=
logger=context userId=0 orgId=0 uname= t=2023-08-11T04:34:35.78394484Z level=error msg=“Request Completed” method=GET path=/api/auth/key status=500 remote_addr=127.0.0.1 time_ms=4462 duration=4.462103618s size=67 referer=
logger=context t=2023-08-11T04:34:35.784610207Z level=error msg=“invalid API key” error=“context canceled” traceID=
logger=context t=2023-08-11T04:34:35.784765303Z level=error msg=“invalid API key” error=“context canceled” traceID=
logger=context userId=0 orgId=0 uname= t=2023-08-11T04:34:35.784943966Z level=error msg=“Request Completed” method=GET path=/api/auth/key status=500 remote_addr=127.0.0.1 time_ms=4464 duration=4.464013022s size=67 referer=
logger=context userId=0 orgId=0 uname= t=2023-08-11T04:34:35.785122005Z level=error msg=“Request Completed” method=GET path=/api/auth/key status=500 remote_addr=127.0.0.1 time_ms=4951 duration=4.951649648s size=67

Interesting! I see some 500 errors. Is there anything corresponding in the PMM-managed.log file? I’ll raise this in the teams chat but this seems like there’s a gremlin so would one of you file a bug in Jira and let me know the ticket?

1 Like

@steve.hoffman I can see the following errors in the PMM-managed.log file.

[33mWARN[0m[2023-08-11T04:34:47.203+00:00] RPC /server.Server/Readiness done in 97.50645ms with gRPC error: rpc error: code = Internal desc = PMM Server is not ready yet. [33mrequest[0m=66cf0c13-3800-11ee-9894-0242ac110002

[33mWARN[0m[2023-08-11T04:35:00.339+00:00] RPC /platform.Platform/UserStatus done in 2.306952ms with gRPC error: rpc error: code = Unauthenticated desc = Failed to get user access token. Please sign in using your Percona Portal account. [33mrequest[0m=6eb1cf8f-3800-11ee-9894-0242ac110002

@steve.hoffman I am not linked to a Portal account so I think mine might be a bit different… I do see in the pmm-managed log file a bunch of unauth calls… I will admit I was working thought an automation script that deployed to hosts. This process regardless of if it was up to date or not removed pmm and installed fresh and setup various items…So I was expecting some invalid entrys… I will do some more looking to see if I can find any valid errors in the logs for you though

@meyerder Even, I am not linked to a Portal account.

The portal auth error is a red herring. There’s a check on entitlement for Advisors downloads and if you’re not authenticated with a portal account you just get the advisors bundle for anonymous users.

I think this is a bug but can’t for the life of me piece together what would cause it other than some missed component that accidentally was left in for the SQLite to Postgres migration. I’ve escalated to the dev team to look at it. If you create a ticket you’ll get automatic updates when we make updates but if not one of the devs will create one and I’ll try to remember to update this thread so you can manually follow.

1 Like

@steve.hoffman I created a Jira bug for the above issue. Could you please follow up with the dev team and fix it in the 2.39 version as we are unable to use the PMM because the UI is not responding? Otherwise, we have to remove the nodes from the PMM server.

[PMM-12415] The PMM dashboard has a high CPU load, and the UI is unresponsive after adding 100 Servers. - Percona JIRA

NOTE: I had removed a few nodes, and now Load has been reduced and the UI is responding fine. Now, I am running with 60 Nodes for testing purposes, and PMM is doing well. Again, I added around 40 servers, then again, same issue.

1 Like

I can only promise the team will review and prioritize the issue accordingly. If we discover a workaround or config change as a fix we will get that out right away but I can’t promise a fix in any particular release version. If you are a Percona customer please escalate through your CSM and we possibly can issue a custom build until it’s fixed in a public release.

2 Likes

@steve.hoffman Many thanks for your assistance.

@steve.hoffman

Any update on the issue? Did we identify the root cause or solution?

After upgrading to the 2.39 version, we are seeing some new errors.

logger=context userId=0 orgId=1 uname= t=2023-08-17T04:10:35.128327747Z level=error msg=“Failed to list api keys” error=“context canceled” remote_addr=127.0.0.1 traceID=
logger=context userId=0 orgId=0 uname= t=2023-08-17T04:10:35.139852137Z level=error msg=“Request Completed” method=GET path=/api/auth/key status=500 remote_addr=127.0.0.1 time_ms=3807 duration=3.807112452s size=67 referer=
logger=context userId=0 orgId=1 uname= t=2023-08-17T04:10:35.139870037Z level=error msg=“Request Completed” method=GET path=/api/auth/key status=500 remote_addr=127.0.0.1 time_ms=3787 duration=3.787860608s size=74 referer= handler=/api/auth/key/
logger=context userId=0 orgId=1 uname= t=2023-08-17T04:10:36.354175758Z level=error msg=“Failed to list api keys” error=“context canceled” remote_addr=127.0.0.1 traceID=
logger=context userId=0 orgId=1 uname= t=2023-08-17T04:10:36.354500845Z level=error msg=“Request Completed” method=GET path=/api/auth/key status=500 remote_addr=127.0.0.1 time_ms=5025 duration=5.025250716s size=74 referer= handler=/api/auth/key/
logger=context userId=0 orgId=1 uname= t=2023-08-17T04:10:36.397063912Z level=error msg=“Failed to list api keys” error=“context canceled” remote_addr=127.0.0.1 traceID=
logger=context userId=0 orgId=1 uname= t=2023-08-17T04:10:36.397392477Z level=error msg=“Request Completed” method=GET path=/api/auth/key status=500 remote_addr=127.0.0.1 time_ms=5047 duration=5.047663732s size=74 referer= handler=/api/auth/key/
logger=context userId=0 orgId=1 uname= t=2023-08-17T04:10:36.582042424Z level=error msg=“Failed to list api keys” error=“context canceled” remote_addr=127.0.0.1 traceID=
logger=context userId=0 orgId=1 uname= t=2023-08-17T04:10:36.582355514Z level=error msg=“Request Completed” method=GET path=/api/auth/key status=500 remote_addr=127.0.0.1 time_ms=5232 duration=5.232576656s size=74 referer= handler=/api/auth/key/
logger=context userId=0 orgId=1 uname= t=2023-08-17T04:10:36.588296253Z level=error msg=“Failed to list api keys” error=“context canceled” remote_addr=127.0.0.1 traceID=
logger=context userId=0 orgId=1 uname= t=2023-08-17T04:10:36.588585874Z level=error msg=“Request Completed” method=GET path=/api/auth/key status=500 remote_addr=127.0.0.1 time_ms=5238 duration=5.238416784s size=74 referer= handler=/api/auth/key/
logger=context userId=0 orgId=1 uname= t=2023-08-17T04:10:36.904513383Z level=error msg=“Failed to list api keys” error=“context canceled” remote_addr=127.0.0.1 traceID=
logger=context userId=0 orgId=1 uname= t=2023-08-17T04:10:36.904968563Z level=error msg=“Request Completed” method=GET path=/api/auth/key status=500 remote_addr=127.0.0.1 time_ms=5048 duration=5.048758827s size=74 referer= handler=/api/auth/key/

I haven’t seen anything definitive on the issue but one of the lead devs is suspicious that we may be overtaxing the Grafana authentication API.

@steve.hoffman Thanks for the update.

One of the PMM users suggested the below temporary fix.

After the below workaround, my issue has been fixed.

Temp Workaround:

docker exec -it CONTAINER_NAME bash
cd to grafana dir
and execute this:
sqlite3 grafana.db 'pragma journal_mode=wal;'

1 Like

Here is a Very good thread…

sqlite> pragma journal_mode=wal;


    Unlike the other journaling modes, PRAGMA journal_mode=WAL is persistent. If a process sets WAL mode, then closes and reopens the database, the database will come back in WAL mode. In contrast, if a process sets (for example) PRAGMA journal_mode=TRUNCATE and then closes and reopens the database will come back up in the default rollback mode of DELETE rather than the previous TRUNCATE setting.

    The persistence of WAL mode means that applications can be converted to using SQLite in WAL mode without making any changes to the application itself. One has merely to run "PRAGMA journal_mode=WAL;" on the database file(s) using the command-line shell or other utility, then restart the application.

    The WAL journal mode will be set on all connections to the same database file if it is set on any one connection.

Now understanding this… and Seeing THIS…

dockerd: time="2023-08-10T09:54:02.540362887-04:00" level=warning msg="Health check for container 7007a8dd1535185db9bfd1c17c8dca68ca7da7d11e337be0b6ea87cd5ebc792e error: context deadline exceeded"

I am thinking and was going down the path of looking to see that maybe it wasnt grafana but something in containerd (While I am also VERY new to) I looked at the containerd config file and said yep. someone other than me needs to figure out what each of ones do… . I have not eyet looked to see if sqllite is configured in WAL mode or not…

It looks like the default is DISABLED and at least a quick glance at SQLite Docs here
https://sqlite.org/wal.html

Seems like maybe we should move it to True when more than XXX nodes. I am willing to try this

@meyerder Anyhow, PMM is migrating from SQLite to PostgreSQL in the next version, 2.40. So this must be a temporary fix for us.

@Naresh9999 @steve.hoffman Looks like Naresh9999 and I were replying about the same time…

It looks like that is a setting in the grafana.ini file that might want to be set initially… But we both said the same thing :slight_smile:

@Naresh9999 The only thing (me just being me) is what changed I am “guessing” you are on one of the newer versions… If you had 500 before and no issues why see it now with 100?

@meyerder

If you had 500 before and no issues, why see it now with 100?

No I have not added 500 servers, If you check the link there, [Lauri] has added 500 servers.

Yes I had found those same articles but was hesitant to suggest it not being able to reproduce the issue :sweat_smile: