Database is Locked pmm 2.38.1

@steve.hoffman I created a Jira bug for the above issue. Could you please follow up with the dev team and fix it in the 2.39 version as we are unable to use the PMM because the UI is not responding? Otherwise, we have to remove the nodes from the PMM server.

[PMM-12415] The PMM dashboard has a high CPU load, and the UI is unresponsive after adding 100 Servers. - Percona JIRA

NOTE: I had removed a few nodes, and now Load has been reduced and the UI is responding fine. Now, I am running with 60 Nodes for testing purposes, and PMM is doing well. Again, I added around 40 servers, then again, same issue.

1 Like

I can only promise the team will review and prioritize the issue accordingly. If we discover a workaround or config change as a fix we will get that out right away but I can’t promise a fix in any particular release version. If you are a Percona customer please escalate through your CSM and we possibly can issue a custom build until it’s fixed in a public release.

2 Likes

@steve.hoffman Many thanks for your assistance.

@steve.hoffman

Any update on the issue? Did we identify the root cause or solution?

After upgrading to the 2.39 version, we are seeing some new errors.

logger=context userId=0 orgId=1 uname= t=2023-08-17T04:10:35.128327747Z level=error msg=“Failed to list api keys” error=“context canceled” remote_addr=127.0.0.1 traceID=
logger=context userId=0 orgId=0 uname= t=2023-08-17T04:10:35.139852137Z level=error msg=“Request Completed” method=GET path=/api/auth/key status=500 remote_addr=127.0.0.1 time_ms=3807 duration=3.807112452s size=67 referer=
logger=context userId=0 orgId=1 uname= t=2023-08-17T04:10:35.139870037Z level=error msg=“Request Completed” method=GET path=/api/auth/key status=500 remote_addr=127.0.0.1 time_ms=3787 duration=3.787860608s size=74 referer= handler=/api/auth/key/
logger=context userId=0 orgId=1 uname= t=2023-08-17T04:10:36.354175758Z level=error msg=“Failed to list api keys” error=“context canceled” remote_addr=127.0.0.1 traceID=
logger=context userId=0 orgId=1 uname= t=2023-08-17T04:10:36.354500845Z level=error msg=“Request Completed” method=GET path=/api/auth/key status=500 remote_addr=127.0.0.1 time_ms=5025 duration=5.025250716s size=74 referer= handler=/api/auth/key/
logger=context userId=0 orgId=1 uname= t=2023-08-17T04:10:36.397063912Z level=error msg=“Failed to list api keys” error=“context canceled” remote_addr=127.0.0.1 traceID=
logger=context userId=0 orgId=1 uname= t=2023-08-17T04:10:36.397392477Z level=error msg=“Request Completed” method=GET path=/api/auth/key status=500 remote_addr=127.0.0.1 time_ms=5047 duration=5.047663732s size=74 referer= handler=/api/auth/key/
logger=context userId=0 orgId=1 uname= t=2023-08-17T04:10:36.582042424Z level=error msg=“Failed to list api keys” error=“context canceled” remote_addr=127.0.0.1 traceID=
logger=context userId=0 orgId=1 uname= t=2023-08-17T04:10:36.582355514Z level=error msg=“Request Completed” method=GET path=/api/auth/key status=500 remote_addr=127.0.0.1 time_ms=5232 duration=5.232576656s size=74 referer= handler=/api/auth/key/
logger=context userId=0 orgId=1 uname= t=2023-08-17T04:10:36.588296253Z level=error msg=“Failed to list api keys” error=“context canceled” remote_addr=127.0.0.1 traceID=
logger=context userId=0 orgId=1 uname= t=2023-08-17T04:10:36.588585874Z level=error msg=“Request Completed” method=GET path=/api/auth/key status=500 remote_addr=127.0.0.1 time_ms=5238 duration=5.238416784s size=74 referer= handler=/api/auth/key/
logger=context userId=0 orgId=1 uname= t=2023-08-17T04:10:36.904513383Z level=error msg=“Failed to list api keys” error=“context canceled” remote_addr=127.0.0.1 traceID=
logger=context userId=0 orgId=1 uname= t=2023-08-17T04:10:36.904968563Z level=error msg=“Request Completed” method=GET path=/api/auth/key status=500 remote_addr=127.0.0.1 time_ms=5048 duration=5.048758827s size=74 referer= handler=/api/auth/key/

I haven’t seen anything definitive on the issue but one of the lead devs is suspicious that we may be overtaxing the Grafana authentication API.

@steve.hoffman Thanks for the update.

One of the PMM users suggested the below temporary fix.

After the below workaround, my issue has been fixed.

Temp Workaround:

docker exec -it CONTAINER_NAME bash
cd to grafana dir
and execute this:
sqlite3 grafana.db 'pragma journal_mode=wal;'

1 Like

Here is a Very good thread…

sqlite> pragma journal_mode=wal;


    Unlike the other journaling modes, PRAGMA journal_mode=WAL is persistent. If a process sets WAL mode, then closes and reopens the database, the database will come back in WAL mode. In contrast, if a process sets (for example) PRAGMA journal_mode=TRUNCATE and then closes and reopens the database will come back up in the default rollback mode of DELETE rather than the previous TRUNCATE setting.

    The persistence of WAL mode means that applications can be converted to using SQLite in WAL mode without making any changes to the application itself. One has merely to run "PRAGMA journal_mode=WAL;" on the database file(s) using the command-line shell or other utility, then restart the application.

    The WAL journal mode will be set on all connections to the same database file if it is set on any one connection.

Now understanding this… and Seeing THIS…

dockerd: time="2023-08-10T09:54:02.540362887-04:00" level=warning msg="Health check for container 7007a8dd1535185db9bfd1c17c8dca68ca7da7d11e337be0b6ea87cd5ebc792e error: context deadline exceeded"

I am thinking and was going down the path of looking to see that maybe it wasnt grafana but something in containerd (While I am also VERY new to) I looked at the containerd config file and said yep. someone other than me needs to figure out what each of ones do… . I have not eyet looked to see if sqllite is configured in WAL mode or not…

It looks like the default is DISABLED and at least a quick glance at SQLite Docs here
https://sqlite.org/wal.html

Seems like maybe we should move it to True when more than XXX nodes. I am willing to try this

@meyerder Anyhow, PMM is migrating from SQLite to PostgreSQL in the next version, 2.40. So this must be a temporary fix for us.

@Naresh9999 @steve.hoffman Looks like Naresh9999 and I were replying about the same time…

It looks like that is a setting in the grafana.ini file that might want to be set initially… But we both said the same thing :slight_smile:

@Naresh9999 The only thing (me just being me) is what changed I am “guessing” you are on one of the newer versions… If you had 500 before and no issues why see it now with 100?

@meyerder

If you had 500 before and no issues, why see it now with 100?

No I have not added 500 servers, If you check the link there, [Lauri] has added 500 servers.

Yes I had found those same articles but was hesitant to suggest it not being able to reproduce the issue :sweat_smile:

@steve.hoffman @meyerder

But for me, issues still exist. Again, I am trying to add the servers to PMM, and again, I am facing the same issue. :smiling_face_with_tear:

@steve.hoffman After crossing the 100 nodes again, I faced the same issue.

Hi @Naresh9999 can I ask 2 questions?
Have you migrated to service accounts? Is it possible that one of your old PMM Clients are still configured to old server or have incorrect API key and keeps trying to connect to PMM Server?

@nurlan I have never migrated to service accounts. This is a new server I have created from scratch.
Is there any way that I can identify which server is trying?

for What it is worth… I did a search for grafana.db inside the docker and found two of them… Not quite sure why or if somehow I did it…

But Once I did pragma journal_mode=wal on both files I have not seen it (yet) I did NOT think to look at the other potential settings prior to me setting this

Additionally I did find that when I work with the Alerting aspect quickly I tend to see it happen more often (modifying alerts after using a Template base that comes with pmm)

@meyerder

Can you please share the locations?
or
How can I search, Can you please guide me?

Inside the docker

find / -name grafana.db
cd to the directory. also research a few other tweaks you can do on the file I am running 80 nodes now with no issues

@meyerder

I found the below files.

bash-5.1# find / -name grafana.db
/etc/grafana/grafana.db
/opt/grafana.db
/srv/grafana/grafana.db
/usr/share/grafana/grafana.db
bash-5.1#

@Naresh9999 @Lauri

While I do NOT know if it is suggested I have not seen the issue since I did the following as well as the above (Again on each grafana.db file as I had two but was not sure WHY I had two)
Do this at your own risk…**** I only did the first three .

When you issue the commands you MUST be in the directory that the grafana.db file is in… Else it does nothing

sqlite> pragma journal_mode=wal;
sqlite> PRAGMA cache_size = 10000; -- Set cache size to 10000 pages
sqlite> PRAGMA synchronous = OFF; -- Disable synchronous write (less durable, more performance)
sqlite> PRAGMA locking_mode = EXCLUSIVE; -- Use exclusive locking mode


sqlite> PRAGMA busy_timeout = 1000; -- Set a timeout of 1 second for waiting on locks
sqlite> PRAGMA temp_store = MEMORY; -- Use memory for temporary storage (if feasible)



Also if you are like me and new to Docker Make sure you commit the changes as well (and then I normally restart the docker)