Percona PMM query analytics is broken

Description:

We are using PMM to monitor our databases. Was using version PMM 3.2.0 & even upgraded to latest version 3.3.1. We see that we are getting below error when we try to look for QAN. And this is persistent across old version and new version of PMM.

Steps to Reproduce:

[Step-by-step instructions on how to reproduce the issue, including any specific settings or configurations]

Version:

3.3.1

To be clear, this error we got directly even clicking on PMM QAN. We are not selecting anything.

More information -

In the qan-api2.log logs, we see below error

time=“2025-08-26T06:12:22.254+00:00” level=info msg=“Log level: info.”
time=“2025-08-26T06:12:22.254+00:00” level=info msg=“DSN: clickhouse://127.0.0.1:9000?database=pmm&block_size=10000&pool_size=2” component=main
stdlog: dsn: clickhouse://127.0.0.1:9000?database=pmm&block_size=10000&pool_size=2
stdlog: Migrations: no migration found for version 20: read down for version 20 migrations/sql: file does not exist
stdlog: qan-api2 v2.44.0.
time=“2025-08-26T06:12:41.620+00:00” level=info msg=“Log level: info.”
time=“2025-08-26T06:12:41.620+00:00” level=info msg=“DSN: clickhouse://127.0.0.1:9000?database=pmm&block_size=10000&pool_size=2” component=main
stdlog: dsn: clickhouse://127.0.0.1:9000?database=pmm&block_size=10000&pool_size=2
stdlog: Migrations: no migration found for version 20: read down for version 20 migrations/sql: file does not exist
stdlog: qan-api2 v2.44.0.
time=“2025-08-26T06:13:01.669+00:00” level=info msg=“Log level: info.”
time=“2025-08-26T06:13:01.669+00:00” level=info msg=“DSN: clickhouse://127.0.0.1:9000?database=pmm&block_size=10000&pool_size=2” component=main
stdlog: dsn: clickhouse://127.0.0.1:9000?database=pmm&block_size=10000&pool_size=2
stdlog: Migrations: no migration found for version 20: read down for version 20 migrations/sql: file does not exist
stdlog: qan-api2 v2.44.0.
time=“2025-08-26T06:13:22.805+00:00” level=info msg=“Log level: info.”
time=“2025-08-26T06:13:22.805+00:00” level=info msg=“DSN: clickhouse://127.0.0.1:9000?database=pmm&block_size=10000&pool_size=2” component=main
stdlog: dsn: clickhouse://127.0.0.1:9000?database=pmm&block_size=10000&pool_size=2
stdlog: Migrations: no migration found for version 20: read down for version 20 migrations/sql: file does not exist
stdlog: qan-api2 v2.44.0.
time=“2025-08-26T06:13:44.968+00:00” level=info msg=“Log level: info.”
time=“2025-08-26T06:13:44.968+00:00” level=info msg=“DSN: clickhouse://127.0.0.1:9000?database=pmm&block_size=10000&pool_size=2” component=main
stdlog: dsn: clickhouse://127.0.0.1:9000?database=pmm&block_size=10000&pool_size=2
stdlog: Migrations: no migration found for version 20: read down for version 20 migrations/sql: file does not exist
stdlog: qan-api2 v2.44.0.
time=“2025-08-26T06:14:08.017+00:00” level=info msg=“Log level: info.”
time=“2025-08-26T06:14:08.017+00:00” level=info msg=“DSN: clickhouse://127.0.0.1:9000?database=pmm&block_size=10000&pool_size=2” component=main
stdlog: dsn: clickhouse://127.0.0.1:9000?database=pmm&block_size=10000&pool_size=2
stdlog: Migrations: no migration found for version 20: read down for version 20 migrations/sql: file does not exist
stdlog: qan-api2 v2.44.0.
time=“2025-08-26T06:14:32.247+00:00” level=info msg=“Log level: info.”
time=“2025-08-26T06:14:32.247+00:00” level=info msg=“DSN: clickhouse://127.0.0.1:9000?database=pmm&block_size=10000&pool_size=2” component=main
stdlog: dsn: clickhouse://127.0.0.1:9000?database=pmm&block_size=10000&pool_size=2
stdlog: Migrations: no migration found for version 20: read down for version 20 migrations/sql: file does not exist
stdlog: qan-api2 v2.44.0.
time=“2025-08-26T06:14:57.363+00:00” level=info msg=“Log level: info.”
time=“2025-08-26T06:14:57.363+00:00” level=info msg=“DSN: clickhouse://127.0.0.1:9000?database=pmm&block_size=10000&pool_size=2” component=main
stdlog: dsn: clickhouse://127.0.0.1:9000?database=pmm&block_size=10000&pool_size=2
stdlog: Migrations: no migration found for version 20: read down for version 20 migrations/sql: file does not exist
stdlog: qan-api2 v3.2.0.
time=“2025-08-26T06:15:36.466+00:00” level=info msg=“Log level: info.”
time=“2025-08-26T06:15:36.466+00:00” level=info msg=“DSN: clickhouse://default:xxxxx@127.0.0.1:9000/pmm” component=main
stdlog: Connection: dial tcp 127.0.0.1:9000: connect: connection refused
stdlog: qan-api2 v3.2.0.
time=“2025-08-26T06:15:37.718+00:00” level=info msg=“Log level: info.”
time=“2025-08-26T06:15:37.718+00:00” level=info msg=“DSN: clickhouse://default:xxxxx@127.0.0.1:9000/pmm” component=main
stdlog: Connection: dial tcp 127.0.0.1:9000: connect: connection refused
stdlog: qan-api2 v3.2.0.
time=“2025-08-26T06:15:39.760+00:00” level=info msg=“Log level: info.”
time=“2025-08-26T06:15:39.760+00:00” level=info msg=“DSN: clickhouse://default:xxxxx@127.0.0.1:9000/pmm” component=main
stdlog: Connection: dial tcp 127.0.0.1:9000: connect: connection refused
stdlog: qan-api2 v3.2.0.
time=“2025-08-26T06:15:42.925+00:00” level=info msg=“Log level: info.”
time=“2025-08-26T06:15:42.925+00:00” level=info msg=“DSN: clickhouse://default:xxxxx@127.0.0.1:9000/pmm” component=main
stdlog: Connection: dial tcp 127.0.0.1:9000: connect: connection refused
stdlog: qan-api2 v3.2.0.
time=“2025-08-26T06:15:46.971+00:00” level=info msg=“Log level: info.”
time=“2025-08-26T06:15:46.971+00:00” level=info msg=“DSN: clickhouse://default:xxxxx@127.0.0.1:9000/pmm” component=main
stdlog: Connection: dial tcp 127.0.0.1:9000: connect: connection refused
stdlog: qan-api2 v3.2.0.
time=“2025-08-26T06:15:52.002+00:00” level=info msg=“Log level: info.”
time=“2025-08-26T06:15:52.002+00:00” level=info msg=“DSN: clickhouse://default:xxxxx@127.0.0.1:9000/pmm” component=main
stdlog: Connection: dial tcp 127.0.0.1:9000: connect: connection refused
stdlog: qan-api2 v3.2.0.
time=“2025-08-26T06:15:58.116+00:00” level=info msg=“Log level: info.”
time=“2025-08-26T06:15:58.116+00:00” level=info msg=“DSN: clickhouse://default:xxxxx@127.0.0.1:9000/pmm” component=main
stdlog: Connection: dial tcp 127.0.0.1:9000: connect: connection refused
stdlog: qan-api2 v3.2.0.

Update :

During the investigation, we saw that the click house was not running inside docker.

and upon checking logs -

2025.08.26 09:22:43.008963 [ 70154 ] {}  Application: Caught exception while loading metadata: Code: 76. DB::ErrnoException: Cannot open file /srv/clickhouse/store/3c8/3c856d25-599f-4ecb-a263-2aedc01ec43e/asynchronous_metric_log.sql, errno: 13, strerror: Permission denied: Cannot parse definition from metadata file /srv/clickhouse/store/3c8/3c856d25-599f-4ecb-a263-2aedc01ec43e/asynchronous_metric_log.sql. (CANNOT_OPEN_FILE), Stack trace (when copying this message, always include the lines below):2025.08.26 09:22:43.924935 [ 70154 ] {}  Application: Code: 76. DB::ErrnoException: Cannot open file /srv/clickhouse/store/3c8/3c856d25-599f-4ecb-a263-2aedc01ec43e/asynchronous_metric_log.sql, errno: 13, strerror: Permission denied: Cannot parse definition from metadata file /srv/clickhouse/store/3c8/3c856d25-599f-4ecb-a263-2aedc01ec43e/asynchronous_metric_log.sql. (CANNOT_OPEN_FILE), Stack trace (when copying this message, always include the lines below):2025.08.26 09:22:43.925044 [ 70154 ] {}  Application: DB::ErrnoException: Cannot open file /srv/clickhouse/store/3c8/3c856d25-599f-4ecb-a263-2aedc01ec43e/asynchronous_metric_log.sql, errno: 13, strerror: Permission denied: Cannot parse definition from metadata file /srv/clickhouse/store/3c8/3c856d25-599f-4ecb-a263-2aedc01ec43e/asynchronous_metric_log.sql

and fix we applied -

chown -R pmm:pmm /srv/clickhouse/

can someone from percona help us on how this has had happened?

Hi

What is the output of supervisorctl status? Is ClickHouse running after the fix you applied?

If not, can you try to bring it up via supervisorctl start clickhouse command?

Cheers

Hi Alex,

Thank you for reply.

After applying the fix, the clickhouse was running again and I was able to see QAN.

Current output -

[root@eb6e746ea3ba opt] # supervisorctl status
clickhouse RUNNING pid 70889, uptime 3:36:23
grafana RUNNING pid 429, uptime 6:36:15
nginx RUNNING pid 16, uptime 6:36:26
nomad-server STOPPED Not started
pmm-agent RUNNING pid 594, uptime 6:36:01
pmm-init EXITED Aug 26 06:23 AM
pmm-managed RUNNING pid 25, uptime 6:36:26
postgresql RUNNING pid 13, uptime 6:36:26
qan-api2 RUNNING pid 70764, uptime 3:36:49
victoriametrics RUNNING pid 581, uptime 6:36:02
vmalert RUNNING pid 18, uptime 6:36:26
vmproxy RUNNING pid 19, uptime 6:36:26

But the problem was, we had an incident this morning and wanted to look in to QAN when we realized that QAN is no more working.

No one has modified or touch this docker since last upgrade. The only that we changed after we upgrade last time was, using our own SSL certificates.

I have logs from PMM that i can share with you but can do that privately over email or link. (due to obvious reasons)

Glad to hear the issue was resolved!

I suggest you check the container health or run supervisorctl status after each upgrade.

No, we will never ask to share the logs privately :slight_smile:

thank you for the answer.

But can this not be done by upgrade process itself? or documented anywhere ? (sorry if its done already) .

We used upgrade from UI itself and if it says success, probably lot of people will trust it.

I do not rule out there is a bug. However, we don’t capture anything like that in our automated tests.

And I don’t even have a good clue as to what could have happened :frowning: