Unable to recover pmm failed while upgrading 2.41.2

Hello,

PMM failed while upgrading from 2.41.1 to 2.41.2 through UI and now its unreachable. Keeps throwing internal server error. How to recover from this error. Kindly assist.

from UI:
502 Bad Gateway


nginx

#pmm-admin inventory list services
Internal server error… Please check username and password.

DEBUG 2024-06-17 20:42:46.00139246Z: /tmp/go/pkg/mod/github.com/go-openapi/runtime@v0.27.1/client/runtime.go:513 github.com/go-openapi/runtime/client.(*Runtime).Submit() HTTP/1.1 401 Unauthorized
Content-Length: 80
Connection: keep-alive
Content-Type: application/json
Date: Mon, 17 Jun 2024 20:42:46 GMT
Server: nginx
Strict-Transport-Security: max-age=63072000; includeSubdomains;

{“code”:13,“error”:“Internal server error.”,“message”:“Internal server error.”}

DEBUG 2024-06-17 20:42:46.001439572Z: /tmp/go/src/github.com/percona/pmm/admin/cli/cli.go:132 github.com/percona/pmm/admin/cli.printResponse() Result:
DEBUG 2024-06-17 20:42:46.001455405Z: /tmp/go/src/github.com/percona/pmm/admin/cli/cli.go:133 github.com/percona/pmm/admin/cli.printResponse() Error: &services.ListServicesDefault{_statusCode:401, Payload:(*services.ListServicesDefaultBody)(0xc0006c71a0)}
Internal server error… Please check username and password.

Hi,

Can you run the following on the server host to see which service is not running: docker exec -it pmm-server supervisorctl status

Also, please check if you can find any clues in the server logs located in /srv/logs. Usually that allows to find the root cause.

Alex

docker exec -it pmm-server supervisorctl status

alertmanager RUNNING pid 32, uptime 2 days, 3:08:58
clickhouse RUNNING pid 15, uptime 2 days, 3:08:58
dbaas-controller STOPPED Not started
grafana FATAL Exited too quickly (process log may have details)
nginx RUNNING pid 25, uptime 2 days, 3:08:58
pmm-agent RUNNING pid 293, uptime 2 days, 3:08:57
pmm-managed RUNNING pid 41, uptime 2 days, 3:08:58
pmm-update-perform STOPPED Not started
pmm-update-perform-init EXITED Jun 18 06:50 PM
postgresql RUNNING pid 14, uptime 2 days, 3:08:58
prometheus STOPPED Not started
qan-api2 RUNNING pid 653, uptime 2 days, 3:08:54
victoriametrics RUNNING pid 27, uptime 2 days, 3:08:58
vmalert RUNNING pid 28, uptime 2 days, 3:08:58
vmproxy RUNNING pid 33, uptime 2 days, 3:08:58

I can see that grafana doesn’t start. The best is to check grafana logs for errors.

You can do so by running: docker exec -it pmm-server tail -f /srv/logs/grafana.log.

Alex

docker exec -it pmm-server tail -f /srv/logs/grafana.log

logger=settings t=2024-06-18T18:55:16.549450946Z level=info msg=“Config loaded from” file=/usr/share/grafana/conf/defaults.ini

logger=settings t=2024-06-18T18:55:16.549458348Z level=info msg=“Config loaded from” file=/etc/grafana/grafana.ini

logger=settings t=2024-06-18T18:55:16.5494619Z level=info msg=“Path Home” path=/usr/share/grafana

logger=settings t=2024-06-18T18:55:16.549464933Z level=info msg=“Path Data” path=/srv/grafana

logger=settings t=2024-06-18T18:55:16.549468822Z level=info msg=“Path Logs” path=/srv/logs

logger=settings t=2024-06-18T18:55:16.549471694Z level=info msg=“Path Plugins” path=/srv/grafana/plugins

logger=settings t=2024-06-18T18:55:16.549474442Z level=info msg=“Path Provisioning” path=/usr/share/grafana/conf/provisioning

logger=settings t=2024-06-18T18:55:16.549477037Z level=info msg=“App mode production”

logger=sqlstore t=2024-06-18T18:55:16.549539071Z level=info msg=“Connecting to DB” dbtype=sqlite3

logger=migrator t=2024-06-18T18:55:16.55010401Z level=error msg=“alert migration failure: could not get migration log” error=“failed to check table existence: unable to open database file: permission denied”

How to resolve this, and why this did not occur in previous upgrade.

I see Grafana tries to connect to sqlite3 database and fails. What version of PMM Server did you migrate from?

Please note that recent versions of PMM Server use a PostgreSQL database for persistence. Sqlite3 won’t work.

Can it be that you mounted your own grafana.ini config file to the container? If that is the case, please check this thread

Alex

Hi @ademidoff ,
I havent set up pmm server with any customizations at my end. Using docker installation of pmm-server. Upgraded from 2.41.1 to 2.41.2 and ran into this error. Please advise to recover this setup as its in bad state since couple of days.

Thanks.

OK, let’s check a couple of things to be sure your configuration is no different from the default one.

  1. If you try to check the contents of your /etc/grafana/grafana.ini file, specifically two sections: [database] and [paths], do they look the same as below:
[database]
# You can configure the database connection by specifying type, host, name, user and password
# as separate properties or as on string using the url properties.

# Either "mysql", "postgres" or "sqlite3", it's your choice
type = postgres
host = localhost
user = grafana
# If the password contains # or ; you have to wrap it with triple quotes. Ex """#password;"""
password = <redacted>

[paths]
# Directory where grafana will automatically scan and look for plugins
plugins = /srv/grafana/plugins
# Directory where grafana can store logs
logs = /srv/logs
# Path to where grafana can store temp files, sessions, and the sqlite3 db (if that is used)
data = /srv/grafana
  1. Does the directory /srv/postgres14 exist? Do you get the same output if you run ls -la /srv/postgres14:
[root@7ee7a9b459b3 opt] # ls -la /srv/postgres14
total 68
drwx------ 19 postgres postgres  4096 Jun 24 16:09 .
drwxr-xr-x 12 root     root       197 Jun 24 16:09 ..
drwx------  7 postgres postgres    67 Jun 24 16:09 base
drwx------  2 postgres postgres  4096 Jun 24 16:10 global
drwx------  2 postgres postgres     6 Jun 23 19:44 pg_commit_ts
drwx------  2 postgres postgres     6 Jun 23 19:44 pg_dynshmem
-rw-------  1 postgres postgres  4789 Jun 23 19:44 pg_hba.conf
-rw-------  1 postgres postgres  1636 Jun 23 19:44 pg_ident.conf
drwx------  4 postgres postgres    68 Jun 24 16:14 pg_logical
drwx------  4 postgres postgres    36 Jun 24 16:09 pg_multixact
drwx------  2 postgres postgres     6 Jun 23 19:44 pg_notify
drwx------  2 postgres postgres     6 Jun 23 19:44 pg_replslot
drwx------  2 postgres postgres     6 Jun 23 19:44 pg_serial
drwx------  2 postgres postgres     6 Jun 23 19:44 pg_snapshots
drwx------  2 postgres postgres     6 Jun 24 16:09 pg_stat
drwx------  2 postgres postgres   134 Jun 24 16:18 pg_stat_tmp
drwx------  2 postgres postgres    18 Jun 24 16:09 pg_subtrans
drwx------  2 postgres postgres     6 Jun 23 19:44 pg_tblspc
drwx------  2 postgres postgres     6 Jun 23 19:44 pg_twophase
-rw-------  1 postgres postgres     3 Jun 23 19:44 PG_VERSION
drwx------  3 postgres postgres    92 Jun 24 16:09 pg_wal
drwx------  2 postgres postgres    18 Jun 24 16:09 pg_xact
-rw-------  1 postgres postgres    88 Jun 23 19:44 postgresql.auto.conf
-rw-------  1 postgres postgres 28734 Jun 23 19:44 postgresql.conf
-rw-------  1 postgres postgres   237 Jun 24 16:09 postmaster.opts
-rw-------  1 postgres postgres    90 Jun 24 16:09 postmaster.pid
  1. Is your directory /srv/backup empty? If not, what does it contain?

Alex

Starting with PMM 2.40.0, PMM uses postgres to maintain the grafana settings (where in previous versions it used an sqlite database). With that release there was an automated migration for users that started with a version of PMM < 2.40.0. While I see your issue came up going from 2.41.1 to 2.41.2, do you know what version of PMM you started with?

to start checking we need to see if you actually have the grafana database for settings:

docker exec -it pmm-server bash
psql -U grafana
\d

you should get roughly 113 or so rows returned…that at least gives confidence that the db was created (may be issue with it but we can get there eventually)

Now take a look at your /etc/grafana/grafana.ini, particularly the [database] section. do you see the following (i suspect not based on the error)

[database]
type = postgres
host = localhost
user = grafana
password = grafana

That will give a better indication of where to look next.

Please check below.

# cat /etc/grafana/grafana.ini
##################### Grafana Configuration #####################
# Only changed settings. You can find default settings in /usr/share/grafana/conf/defaults.ini

#################################### Database ####################################
[database]
# You can configure the database connection by specifying type, host, name, user and password
# as separate properties or as on string using the url properties.

# Either "mysql", "postgres" or "sqlite3", it's your choice
# If the password contains # or ; you have to wrap it with triple quotes. Ex """#password;"""

[paths]
# Directory where grafana will automatically scan and look for plugins
plugins = /srv/grafana/plugins
# Directory where grafana can store logs
logs = /srv/logs
# Path to where grafana can store temp files, sessions, and the sqlite3 db (if that is used)
data = /srv/grafana

#################################### Logging ##########################
[log]


ls -la /srv/postgres14
total 76
drwx------. 19 postgres postgres  4096 Jun 21 19:14 .
drwxr-xr-x. 14 root     root      4096 Jun 21 19:13 ..
drwx------.  7 postgres postgres    67 Jun 21 19:12 base
drwx------.  2 postgres postgres  4096 Jun 24 05:53 global
drwx------.  2 postgres postgres     6 Mar 19 21:15 pg_commit_ts
drwx------.  2 postgres postgres     6 Mar 19 21:15 pg_dynshmem
-rw-------.  1 postgres postgres  4789 Mar 19 21:15 pg_hba.conf
-rw-------.  1 postgres postgres  1636 Mar 19 21:15 pg_ident.conf
drwx------.  4 postgres postgres    68 Jun 25 19:01 pg_logical
drwx------.  4 postgres postgres    36 Mar 27 22:12 pg_multixact
drwx------.  2 postgres postgres     6 Mar 19 21:15 pg_notify
drwx------.  2 postgres postgres     6 Mar 19 21:15 pg_replslot
drwx------.  2 postgres postgres     6 Mar 19 21:15 pg_serial
drwx------.  2 postgres postgres     6 Mar 19 21:15 pg_snapshots
drwx------.  2 postgres postgres     6 Jun 21 19:14 pg_stat
drwx------.  2 postgres postgres   134 Jun 25 19:03 pg_stat_tmp
drwx------.  2 postgres postgres    18 Jun 17 17:29 pg_subtrans
drwx------.  2 postgres postgres     6 Mar 19 21:15 pg_tblspc
drwx------.  2 postgres postgres     6 Mar 19 21:15 pg_twophase
-rw-------.  1 postgres postgres     3 Mar 19 21:15 PG_VERSION
drwx------.  3 postgres postgres    92 Jun 24 15:00 pg_wal
drwx------.  2 postgres postgres  4096 Jun 16 01:03 pg_xact
-rw-------.  1 postgres postgres    88 Mar 19 21:15 postgresql.auto.conf
-rw-------.  1 postgres postgres 28742 Mar 19 21:15 postgresql.conf
-rw-------.  1 postgres postgres   237 Jun 21 19:14 postmaster.opts
-rw-------.  1 postgres postgres    90 Jun 21 19:14 postmaster.pid


cd /srv/backup 
# ls -ltr
total 0

Hi @steve.hoffman

I dont remember the version started with it must be 2.38.

I see lot of tables in the database.


# docker exec -it pmm-server bash
grafana=> \l+
                                                                    List of databases
    Name     |  Owner   | Encoding |  Collate   |   Ctype    |   Access privileges   |  Size   | Tablespace |                Description                 
-------------+----------+----------+------------+------------+-----------------------+---------+------------+--------------------------------------------
 grafana     | postgres | UTF8     | en_US.utf8 | en_US.utf8 | =Tc/postgres         +| 15 MB   | pg_default | 
             |          |          |            |            | postgres=CTc/postgres+|         |            | 
             |          |          |            |            | grafana=CTc/postgres  |         |            | 
 pmm-managed | postgres | UTF8     | en_US.utf8 | en_US.utf8 |                       | 9681 kB | pg_default | 
 postgres    | postgres | UTF8     | en_US.utf8 | en_US.utf8 |                       | 8657 kB | pg_default | default administrative connection database
 template0   | postgres | UTF8     | en_US.utf8 | en_US.utf8 | =c/postgres          +| 8433 kB | pg_default | unmodifiable empty database
             |          |          |            |            | postgres=CTc/postgres |         |            | 
 template1   | postgres | UTF8     | en_US.utf8 | en_US.utf8 | =c/postgres          +| 8433 kB | pg_default | default template for new databases
             |          |          |            |            | postgres=CTc/postgres |         |            | 
(5 rows)



 # psql -U grafana
psql (14.12 - Percona Distribution)
Type "help" for help.

grafana=> \d
                        List of relations
 Schema |               Name                |   Type   |  Owner  
--------+-----------------------------------+----------+---------
 public | alert                             | table    | grafana
 public | alert_configuration               | table    | grafana
 public | alert_configuration_id_seq        | sequence | grafana
 public | alert_id_seq                      | sequence | grafana
 public | alert_image                       | table    | grafana
 public | alert_image_id_seq                | sequence | grafana
 public | alert_instance                    | table    | grafana
 public | alert_notification                | table    | grafana
 public | alert_notification_id_seq         | sequence | grafana
 public | alert_notification_state          | table    | grafana

I assume that’s not an all-inclusive list of tables in the grafana schema but do you see a dashboard table or permission table? Those would be indicators that the Grafana DB has been prepped to handle all the grafana specific settings. (Alert tables may be also but I know we build our own alerting feature before grafana did so don’t want to assume).

It may be as simple as just needing to add

type = postgres
host = localhost
user = grafana
password = grafana

to the database section of your /etc/grafana/grafana.ini

and restarting grafana with supervisorctl restart grafana

There is also a world where you just have a permission issue on the sqlite db (it would be in /srv/grafana/grafana.db) and should be owner/group grafana/grafana. It’s possible the permissions were wrong preventing both grafana from starting and the migration from happening (just a guess).

Thanks @steve.hoffman for quick response.

  1. Update the grafana config file /etc/grafana/grafana.ini as mentioned above.
  2. Restarted all services using supervisorctl
  3. It kicked off upgrade to 2.42.0
supervisorctl status
alertmanager                     RUNNING   pid 1729054, uptime 0:07:32
clickhouse                       RUNNING   pid 1728551, uptime 0:07:36
dbaas-controller                 RUNNING   pid 1722611, uptime 0:13:15
grafana                          RUNNING   pid 1729865, uptime 0:06:58
nginx                            RUNNING   pid 1729108, uptime 0:07:31
pmm-agent                        RUNNING   pid 1730215, uptime 0:06:46
pmm-managed                      RUNNING   pid 1730169, uptime 0:06:46
pmm-update-perform               EXITED    Jun 25 10:10 PM
pmm-update-perform-init          EXITED    Jun 25 10:10 PM
postgresql                       RUNNING   pid 1722600, uptime 0:13:15
prometheus                       FATAL     Exited too quickly (process log may have details)
qan-api2                         RUNNING   pid 1729241, uptime 0:07:28
victoriametrics                  RUNNING   pid 1722617, uptime 0:13:15
vmalert                          RUNNING   pid 1722618, uptime 0:13:15
vmproxy                          RUNNING   pid 1722628, uptime 0:13:15
 

  1. Dashboard is accessible. However i cant run any pmm-admin commands

PMM Server:
URL : https://<X.X.X.X>
Version: 2.42.0

PMM Client:
Connected : true
Time drift : 84.66µs
Latency : 225.309µs
Connection uptime: 95.55
pmm-admin version: 2.41.2
pmm-agent version: 2.41.2

# pmm-admin inventory list services

invalid API key. Please check username and password.

Also before that i tried to modify ownership for /srv/grafana/grafana.db , it didnt work. Moreoever this file was deleted after the upgrade.

ls -ltr /srv/grafana/grafana.db
-rw-r-----. 1 grafana render 10928128 Jun 25 21:47 /srv/grafana/grafana.db
 # chown grafana:grafana /srv/grafana/grafana.db
 # 

Wow, now it looks much better :slight_smile:

FYI: Running pmm-admin list or its longer equivalent pmm-admin inventory list services is supposed to yield an error ...Invalid API key, since the agent config located at /usr/local/percona/pmm2/config/pmm-agent.yaml does not contain credentials.

However, if you are able to see the inventory page at https://your-pmm-server/graph/inventory/services, then we should be good.

Also, consider stopping the prometheus job with supervisorctl stop prometheus, as it likely prevents the PMM container from getting into the healthy status.

Alex