PMM Fails to start after upgrade to 2.37.1

I am having trouble trying to get metrics logging after the upgrade, I keep showing a 502 error on the graphs.

Running docker exec f8d2d0043747 supervisorctl status
I see some services not starting

alertmanager                     RUNNING   pid 25, uptime 1:03:15
clickhouse                       RUNNING   pid 15, uptime 1:03:15
dbaas-controller                 STOPPED   Not started
grafana                          RUNNING   pid 19, uptime 1:03:15
nginx                            RUNNING   pid 20, uptime 1:03:15
pmm-agent                        RUNNING   pid 33, uptime 1:03:15
pmm-managed                      RUNNING   pid 29, uptime 1:03:15
pmm-update-perform               STOPPED   Not started
pmm-update-perform-init          EXITED    Jun 15 12:59 PM
postgresql                       RUNNING   pid 14, uptime 1:03:15
prometheus                       STOPPED   Not started
qan-api2                         RUNNING   pid 279, uptime 1:03:08
victoriametrics                  FATAL     Exited too quickly (process log may have details)
vmalert                          RUNNING   pid 22, uptime 1:03:15
vmproxy                          RUNNING   pid 27, uptime 1:03:15

If I run the same command a second time I get

Error: error creating OCI runtime exit file path /var/lib/containers/storage/overlay-containers/f8d2d00437477197ab14a38fe3dafd2f42a046b30efcbe46fb286565d877a28e/userdata/38a46efc2e9bd7f9d9eed3c00f3f67304ff741559a22742c1ab4360a1d31002c/exit: mkdir /var/lib/containers/storage/overlay-containers/f8d2d00437477197ab14a38fe3dafd2f42a046b30efcbe46fb286565d877a28e/userdata/38a46efc2e9bd7f9d9eed3c00f3f67304ff741559a22742c1ab4360a1d31002c/exit: structure needs cleaning

Any idea what is going on here and how to resolve?

As per the service victoriametrics FATAL Exited too quickly (process log may have details)

Please check what error in /srv/logs/ victoriametrics log in side pmm-server container.

Also from which version pmm version you upgrade to 2.37.1 ?

Hi,

Did you check for the free space left on the machine where PMM is running? If there is enough disk memory, could you try to start the failed process with docker exec f8d2d0043747 supervisorctl start victoriametrics?

I gave this a try and all I get is

victoriametrics: ERROR (spawn error)

Hello Lalit,

The error from the log shows

$
2023-06-16T09:08:10.029Z        panic   /home/builder/rpm/BUILD/VictoriaMetrics-pmm-6401-v1.89.1/app/vmselect/netstorage/tmp_blocks_file.go:26  FATAL: cannot create "/srv/victoria$
panic: FATAL: cannot create "/srv/victoriametrics/data/tmp/searchResults": mkdir /srv/victoriametrics/data/tmp/searchResults: structure needs cleaning

Any help appreciated.

More log if it helps any?

2023-06-19T16:11:29.135Z        info    /home/builder/rpm/BUILD/VictoriaMetrics-pmm-6401-v1.89.1/lib/mergeset/table.go:404      inmemory parts have been successfully flushed to files in 0.000 seconds at "/srv/victoriametrics/data/indexdb/17393B3636185011"
2023-06-19T16:11:29.135Z        info    /home/builder/rpm/BUILD/VictoriaMetrics-pmm-6401-v1.89.1/lib/mergeset/table.go:406      waiting for flush callback worker to stop on "/srv/victoriametrics/data/indexdb/17393B3636185011"...
2023-06-19T16:11:29.135Z        info    /home/builder/rpm/BUILD/VictoriaMetrics-pmm-6401-v1.89.1/lib/mergeset/table.go:409      flush callback worker stopped in 0.000 seconds on "/srv/victoriametrics/data/indexdb/17393B3636185011"
2023-06-19T16:11:29.135Z        info    /home/builder/rpm/BUILD/VictoriaMetrics-pmm-6401-v1.89.1/lib/mergeset/table.go:396      waiting for background workers to stop on "/srv/victoriametrics/data/indexdb/17393B3636185010"...
2023-06-19T16:11:29.135Z        info    /home/builder/rpm/BUILD/VictoriaMetrics-pmm-6401-v1.89.1/lib/mergeset/table.go:399      background workers stopped in 0.000 seconds on "/srv/victoriametrics/data/indexdb/17393B3636185010"
2023-06-19T16:11:29.135Z        info    /home/builder/rpm/BUILD/VictoriaMetrics-pmm-6401-v1.89.1/lib/mergeset/table.go:401      flushing inmemory parts to files on "/srv/victoriametrics/data/indexdb/17393B3636185010"...
2023-06-19T16:11:29.136Z        info    /home/builder/rpm/BUILD/VictoriaMetrics-pmm-6401-v1.89.1/lib/mergeset/table.go:404      inmemory parts have been successfully flushed to files in 0.000 seconds at "/srv/victoriametrics/data/indexdb/17393B3636185010"
2023-06-19T16:11:29.136Z        info    /home/builder/rpm/BUILD/VictoriaMetrics-pmm-6401-v1.89.1/lib/mergeset/table.go:406      waiting for flush callback worker to stop on "/srv/victoriametrics/data/indexdb/17393B3636185010"...
2023-06-19T16:11:29.136Z        info    /home/builder/rpm/BUILD/VictoriaMetrics-pmm-6401-v1.89.1/lib/mergeset/table.go:409      flush callback worker stopped in 0.000 seconds on "/srv/victoriametrics/data/indexdb/17393B3636185010"
2023-06-19T16:11:29.137Z        fatal   /home/builder/rpm/BUILD/VictoriaMetrics-pmm-6401-v1.89.1/app/vmstorage/main.go:113      cannot open a storage at /srv/victoriametrics/data with -retentionPeriod=14d: cannot open table at "/srv/victoriametrics/data/data": cannot open partitions in the table "/srv/victoriametrics/data/data": cannot open partition "2023_05": cannot open big parts from "/srv/victoriametrics/data/data/big/2023_05": cannot create directories for partition "/srv/victoriametrics/data/data/big/2023_05": cannot create tmp directory "/srv/victoriametrics/data/data/big/2023_05/tmp": mkdir /srv/victoriametrics/data/data/big/2023_05/tmp: structure needs cleaning

The last line of the log is what jumps out at me.

The error "cannot open table at “/srv/victoriametrics/data/data” (and subsequent ones) seem to hint at some sort of data corruption.

The first thing I’d do is sneak a peek at that directory to see if the structure exists:
docker exec -it <pmm-server-name> bash
ls -al /srv/victoriametrics/data/data
(take note of owner/group…should be pmm and pmm)
Assuming the big/small/lock files are present I’d start with looking to see what’s in ./big/2023_05 there should be a tmp and txn folder in there. Again, get the permissions for each directory as I’m suspect that something changed either ownership or permissions wise (all dirs should be 755 and files 644 with pmm as owner and group).

It may be possible to manually create directorys missing and restart pmm but need to understand why first.