PMM failed while upgrading from 2.41.1 to 2.41.2 through UI and now its unreachable. Keeps throwing internal server error. How to recover from this error. Kindly assist.
from UI:
502 Bad Gateway
nginx
#pmm-admin inventory list services
Internal server error… Please check username and password.
logger=sqlstore t=2024-06-18T18:55:16.549539071Z level=info msg=“Connecting to DB” dbtype=sqlite3
logger=migrator t=2024-06-18T18:55:16.55010401Z level=error msg=“alert migration failure: could not get migration log” error=“failed to check table existence: unable to open database file: permission denied”
How to resolve this, and why this did not occur in previous upgrade.
Hi @ademidoff ,
I havent set up pmm server with any customizations at my end. Using docker installation of pmm-server. Upgraded from 2.41.1 to 2.41.2 and ran into this error. Please advise to recover this setup as its in bad state since couple of days.
OK, let’s check a couple of things to be sure your configuration is no different from the default one.
If you try to check the contents of your /etc/grafana/grafana.ini file, specifically two sections: [database] and [paths], do they look the same as below:
[database]
# You can configure the database connection by specifying type, host, name, user and password
# as separate properties or as on string using the url properties.
# Either "mysql", "postgres" or "sqlite3", it's your choice
type = postgres
host = localhost
user = grafana
# If the password contains # or ; you have to wrap it with triple quotes. Ex """#password;"""
password = <redacted>
[paths]
# Directory where grafana will automatically scan and look for plugins
plugins = /srv/grafana/plugins
# Directory where grafana can store logs
logs = /srv/logs
# Path to where grafana can store temp files, sessions, and the sqlite3 db (if that is used)
data = /srv/grafana
Does the directory /srv/postgres14 exist? Do you get the same output if you run ls -la /srv/postgres14:
[root@7ee7a9b459b3 opt] # ls -la /srv/postgres14
total 68
drwx------ 19 postgres postgres 4096 Jun 24 16:09 .
drwxr-xr-x 12 root root 197 Jun 24 16:09 ..
drwx------ 7 postgres postgres 67 Jun 24 16:09 base
drwx------ 2 postgres postgres 4096 Jun 24 16:10 global
drwx------ 2 postgres postgres 6 Jun 23 19:44 pg_commit_ts
drwx------ 2 postgres postgres 6 Jun 23 19:44 pg_dynshmem
-rw------- 1 postgres postgres 4789 Jun 23 19:44 pg_hba.conf
-rw------- 1 postgres postgres 1636 Jun 23 19:44 pg_ident.conf
drwx------ 4 postgres postgres 68 Jun 24 16:14 pg_logical
drwx------ 4 postgres postgres 36 Jun 24 16:09 pg_multixact
drwx------ 2 postgres postgres 6 Jun 23 19:44 pg_notify
drwx------ 2 postgres postgres 6 Jun 23 19:44 pg_replslot
drwx------ 2 postgres postgres 6 Jun 23 19:44 pg_serial
drwx------ 2 postgres postgres 6 Jun 23 19:44 pg_snapshots
drwx------ 2 postgres postgres 6 Jun 24 16:09 pg_stat
drwx------ 2 postgres postgres 134 Jun 24 16:18 pg_stat_tmp
drwx------ 2 postgres postgres 18 Jun 24 16:09 pg_subtrans
drwx------ 2 postgres postgres 6 Jun 23 19:44 pg_tblspc
drwx------ 2 postgres postgres 6 Jun 23 19:44 pg_twophase
-rw------- 1 postgres postgres 3 Jun 23 19:44 PG_VERSION
drwx------ 3 postgres postgres 92 Jun 24 16:09 pg_wal
drwx------ 2 postgres postgres 18 Jun 24 16:09 pg_xact
-rw------- 1 postgres postgres 88 Jun 23 19:44 postgresql.auto.conf
-rw------- 1 postgres postgres 28734 Jun 23 19:44 postgresql.conf
-rw------- 1 postgres postgres 237 Jun 24 16:09 postmaster.opts
-rw------- 1 postgres postgres 90 Jun 24 16:09 postmaster.pid
Is your directory /srv/backup empty? If not, what does it contain?
Starting with PMM 2.40.0, PMM uses postgres to maintain the grafana settings (where in previous versions it used an sqlite database). With that release there was an automated migration for users that started with a version of PMM < 2.40.0. While I see your issue came up going from 2.41.1 to 2.41.2, do you know what version of PMM you started with?
to start checking we need to see if you actually have the grafana database for settings:
you should get roughly 113 or so rows returned…that at least gives confidence that the db was created (may be issue with it but we can get there eventually)
Now take a look at your /etc/grafana/grafana.ini, particularly the [database] section. do you see the following (i suspect not based on the error)
[database]
type = postgres
host = localhost
user = grafana
password = grafana
That will give a better indication of where to look next.
# cat /etc/grafana/grafana.ini
##################### Grafana Configuration #####################
# Only changed settings. You can find default settings in /usr/share/grafana/conf/defaults.ini
#################################### Database ####################################
[database]
# You can configure the database connection by specifying type, host, name, user and password
# as separate properties or as on string using the url properties.
# Either "mysql", "postgres" or "sqlite3", it's your choice
# If the password contains # or ; you have to wrap it with triple quotes. Ex """#password;"""
[paths]
# Directory where grafana will automatically scan and look for plugins
plugins = /srv/grafana/plugins
# Directory where grafana can store logs
logs = /srv/logs
# Path to where grafana can store temp files, sessions, and the sqlite3 db (if that is used)
data = /srv/grafana
#################################### Logging ##########################
[log]
ls -la /srv/postgres14
total 76
drwx------. 19 postgres postgres 4096 Jun 21 19:14 .
drwxr-xr-x. 14 root root 4096 Jun 21 19:13 ..
drwx------. 7 postgres postgres 67 Jun 21 19:12 base
drwx------. 2 postgres postgres 4096 Jun 24 05:53 global
drwx------. 2 postgres postgres 6 Mar 19 21:15 pg_commit_ts
drwx------. 2 postgres postgres 6 Mar 19 21:15 pg_dynshmem
-rw-------. 1 postgres postgres 4789 Mar 19 21:15 pg_hba.conf
-rw-------. 1 postgres postgres 1636 Mar 19 21:15 pg_ident.conf
drwx------. 4 postgres postgres 68 Jun 25 19:01 pg_logical
drwx------. 4 postgres postgres 36 Mar 27 22:12 pg_multixact
drwx------. 2 postgres postgres 6 Mar 19 21:15 pg_notify
drwx------. 2 postgres postgres 6 Mar 19 21:15 pg_replslot
drwx------. 2 postgres postgres 6 Mar 19 21:15 pg_serial
drwx------. 2 postgres postgres 6 Mar 19 21:15 pg_snapshots
drwx------. 2 postgres postgres 6 Jun 21 19:14 pg_stat
drwx------. 2 postgres postgres 134 Jun 25 19:03 pg_stat_tmp
drwx------. 2 postgres postgres 18 Jun 17 17:29 pg_subtrans
drwx------. 2 postgres postgres 6 Mar 19 21:15 pg_tblspc
drwx------. 2 postgres postgres 6 Mar 19 21:15 pg_twophase
-rw-------. 1 postgres postgres 3 Mar 19 21:15 PG_VERSION
drwx------. 3 postgres postgres 92 Jun 24 15:00 pg_wal
drwx------. 2 postgres postgres 4096 Jun 16 01:03 pg_xact
-rw-------. 1 postgres postgres 88 Mar 19 21:15 postgresql.auto.conf
-rw-------. 1 postgres postgres 28742 Mar 19 21:15 postgresql.conf
-rw-------. 1 postgres postgres 237 Jun 21 19:14 postmaster.opts
-rw-------. 1 postgres postgres 90 Jun 21 19:14 postmaster.pid
cd /srv/backup
# ls -ltr
total 0
I assume that’s not an all-inclusive list of tables in the grafana schema but do you see a dashboard table or permission table? Those would be indicators that the Grafana DB has been prepped to handle all the grafana specific settings. (Alert tables may be also but I know we build our own alerting feature before grafana did so don’t want to assume).
It may be as simple as just needing to add
type = postgres
host = localhost
user = grafana
password = grafana
to the database section of your /etc/grafana/grafana.ini
and restarting grafana with supervisorctl restart grafana
There is also a world where you just have a permission issue on the sqlite db (it would be in /srv/grafana/grafana.db) and should be owner/group grafana/grafana. It’s possible the permissions were wrong preventing both grafana from starting and the migration from happening (just a guess).
FYI: Running pmm-admin list or its longer equivalent pmm-admin inventory list services is supposed to yield an error ...Invalid API key, since the agent config located at /usr/local/percona/pmm2/config/pmm-agent.yaml does not contain credentials.
However, if you are able to see the inventory page at https://your-pmm-server/graph/inventory/services, then we should be good.
Also, consider stopping the prometheus job with supervisorctl stop prometheus, as it likely prevents the PMM container from getting into the healthy status.