QAN API is Down after upgrade the PMM server from 2.44.0 to 3.2.0 version

Hi Team,

After we upgrade the PMM server from 2.44.0 to 3.2.0, after we verify the QAN dashboard, it’s not showing any data, and in the PMM health check dashboard, it’s showing the QAN API is in a down state.

NOTE: Upgrade went fine with the below message.
PMM Server has been successfully setup on this system!

Please find the error details below.

time=“2025-06-18T04:26:27.904+00:00” level=info msg=“RPC /qan.v1beta1.Collector/Collect done in 99.412µs.” request=66eb9ac8-4bfc-11f0-a0c4-0242ac110002
time=“2025-06-18T04:26:28.428+00:00” level=info msg=“Saved 619 buckets in 523.668778ms.” component=data_ingestion
stdlog: Got SIGTERM, shutting down…
time=“2025-06-18T04:26:32.183+00:00” level=info msg=“Server stopped.” component=debug
time=“2025-06-18T04:26:32.183+00:00” level=info msg=“Server stopped.” component=JSON
time=“2025-06-18T04:26:32.185+00:00” level=warning msg=“Closing requests channel.” component=data_ingestion
time=“2025-06-18T04:26:32.185+00:00” level=warning msg=“Requests channel closed, nothing to store.” component=data_ingestion
time=“2025-06-18T04:26:32.185+00:00” level=warning msg=“Requests channel closed, nothing to store.” component=data_ingestion
time=“2025-06-18T04:26:32.185+00:00” level=info msg=“Server stopped.” component=gRPC
time=“2025-06-18T04:26:32.185+00:00” level=info msg=Done. component=main
stdlog: qan-api2 v2.44.0.
time=“2025-06-18T04:27:36.811+00:00” level=info msg=“Log level: info.”
time=“2025-06-18T04:27:36.811+00:00” level=info msg=“DSN: clickhouse://127.0.0.1:9000?database=pmm&block_size=10000&pool_size=2” component=main
stdlog: Connection: dial tcp 127.0.0.1:9000: connect: connection refused
stdlog: qan-api2 v2.44.0.
time=“2025-06-18T04:27:37.844+00:00” level=info msg=“Log level: info.”
time=“2025-06-18T04:27:37.844+00:00” level=info msg=“DSN: clickhouse://127.0.0.1:9000?database=pmm&block_size=10000&pool_size=2” component=main
stdlog: Connection: dial tcp 127.0.0.1:9000: connect: connection refused
stdlog: qan-api2 v2.44.0.
time=“2025-06-18T04:27:39.914+00:00” level=info msg=“Log level: info.”
time=“2025-06-18T04:27:39.914+00:00” level=info msg=“DSN: clickhouse://127.0.0.1:9000?database=pmm&block_size=10000&pool_size=2” component=main
stdlog: Connection: dial tcp 127.0.0.1:9000: connect: connection refused
stdlog: qan-api2 v3.2.0.
time=“2025-06-18T04:27:56.456+00:00” level=info msg=“Log level: info.”
time=“2025-06-18T04:27:56.456+00:00” level=info msg=“DSN: clickhouse://default:xxxxx@127.0.0.1:9000/pmm” component=main
stdlog: Connection: dial tcp 127.0.0.1:9000: connect: connection refused
stdlog: qan-api2 v3.2.0.
time=“2025-06-18T04:27:57.474+00:00” level=info msg=“Log level: info.”
time=“2025-06-18T04:27:57.474+00:00” level=info msg=“DSN: clickhouse://default:xxxxx@127.0.0.1:9000/pmm” component=main
stdlog: Connection: dial tcp 127.0.0.1:9000: connect: connection refused
stdlog: qan-api2 v3.2.0.
time=“2025-06-18T04:27:59.560+00:00” level=info msg=“Log level: info.”
time=“2025-06-18T04:27:59.560+00:00” level=info msg=“DSN: clickhouse://default:xxxxx@127.0.0.1:9000/pmm” component=main
stdlog: qan-api2 v3.2.0.
time=“2025-06-18T04:28:01.552+00:00” level=info msg=“Log level: info.”
time=“2025-06-18T04:28:01.552+00:00” level=info msg=“DSN: clickhouse://default:xxxxx@127.0.0.1:9000/pmm” component=main
:

Hi, can you share logs from clickhouse as well?

Hi @nurlan

I see the below errors in the Clickhouse log file.

  1. ? @ 0x00007f40043050fa in ?
  2. ? @ 0x00007f40043894c4 in ?
    (version 23.8.2.7 (official build))
    2025.06.18 07:21:48.301352 [ 142649 ] {} ServerErrorHandler: Code: 516. DB::Exception: default: Authentication failed: password is incorrect, or there
    is no user with such name.

If you have installed ClickHouse and forgot password you can reset it in the configuration file.
The password for default user is typically located at /etc/clickhouse-server/users.d/default-password.xml
and deleting this file will reset the password.
See also /etc/clickhouse-server/users.xml on the server where ClickHouse is installed.

. (AUTHENTICATION_FAILED), Stack trace (when copying this message, always include the lines below):

  1. DB::Exception::Exception(DB::Exception::MessageMasked&&, int, bool) @ 0x000000000c604bf7 in /usr/bin/clickhouse

  2. DB::Exception::Exception(PreformattedMessage&&, int) @ 0x000000000713cbf1 in /usr/bin/clickhouse

  3. DB::AccessControl::authenticate(DB::Credentials const&, Poco::Net::IPAddress const&) const @ 0x0000000010e3775f in /usr/bin/clickhouse

  4. DB::Session::authenticate(DB::Credentials const&, Poco::Net::SocketAddress const&) @ 0x000000001207f7ed in /usr/bin/clickhouse

  5. DB::TCPHandler::runImpl() @ 0x000000001310afee in /usr/bin/clickhouse

  6. DB::TCPHandler::run() @ 0x000000001311e839 in /usr/bin/clickhouse

  7. Poco::Net::TCPServerConnection::start() @ 0x0000000015b104d4 in /usr/bin/clickhouse

  8. Poco::Net::TCPServerDispatcher::run() @ 0x0000000015b116d1 in /usr/bin/clickhouse

  9. Poco::PooledThread::run() @ 0x0000000015c47f07 in /usr/bin/clickhouse

  10. Poco::ThreadImpl::runnableEntry(void*) @ 0x0000000015c461dc in /usr/bin/clickhouse

  11. ? @ 0x00007f40043050fa in ?

  12. ? @ 0x00007f40043894c4 in ?
    (version 23.8.2.7 (official build))
    2025.06.18 07:21:51.547830 [ 142649 ] {} Access(user directories): from: 127.0.0.1, user: default: Authentication failed: Code: 193. DB::Exception: Invalid credentials. (WRONG_PASSWORD), Stack trace (when copying this message, always include the lines below):

  13. DB::Exception::Exception(DB::Exception::MessageMasked&&, int, bool) @ 0x000000000c604bf7 in /usr/bin/clickhouse

  14. DB::Exception::Exception<char const (&) [20]>(int, char const (&) [20]) @ 0x0000000007480880 in /usr/bin/clickhouse

  15. DB::IAccessStorage::throwInvalidCredentials() @ 0x0000000010ec3fa3 in /usr/bin/clickhouse

  16. DB::IAccessStorage::authenticateImpl(DB::Credentials const&, Poco::Net::IPAddress const&, DB::ExternalAuthenticators const&, bool, bool, bool) const @ 0x0000000010ec3c8e in /usr/bin/clickhouse
    :

Even I have restored the 2.44.0 backup and tried to upgrade it to the PMM 3.2.0 version, but still the same issue…

I have tested one new temp server. I have installed an empty 2.44 server freshly, and I have upgraded the empty new 2.44 server to 3.2.0, and here it’s a success, and the ClickHouse API came up fine.

Not sure why an already running PMM server or old data is failing to start the QAN API after the PMM 3.2.0 upgrade.

Has something changed in the ClickHouse QAN API or ClickHouseDB plugins between 3.1.0 and 3.2.0?

Can someone help me with this issue?

We have created password for clickhouse user in 3.2.0. Upgrade supposed to set password.

In my case, it’s failing to update the password.

@nurlan, I have resolved the issue by following the steps outlined below.

Note: Prior to attempting the PMM 3.2.0 upgrade, I had already taken a backup of PMM 2.44.0 from the production server.

  • Production Server: prdpmm101
  • Temporary Server: tmppmm101

After the PMM 3.2.0 upgrade on the production server failed, I decided to restore the PMM 2.44.0 backup to the temporary server, perform the upgrade there, and then migrate the working setup back to the production environment.

Steps taken to resolve the PMM 3.2.0 upgrade issue:

  1. Restored the PMM 2.44.0 backup on the temporary server tmppmm101.
  2. Upgraded the temporary server to PMM 3.2.0.
  3. Verified the status of all services using the PMM Health Check dashboard.
  4. This time, the Query Analytics (QAN) functionality came up without issues.
  5. After confirming the upgrade was successful, I took a backup of PMM 3.2.0 from the temporary server.
  6. Copied the PMM 3.2.0 backup from tmppmm101 to the production server prdpmm101.
  7. Performed a fresh installation of PMM 3.2.0 on the production server.
  8. Restored the PMM 3.2.0 backup to the production server.
  9. Post-restore, I observed that the internal PostgreSQL monitoring was showing issues — specifically, the agent status was displayed as Down on the Service Summary dashboard.
  10. I reported this issue in the Percona forum:
    PMM3 | PostgreSQL | Inside container FATAL: role does not exist
  11. To address the PostgreSQL role issue, it is essential to follow these steps:
  • After the fresh installation of PMM 3.2.0 but before restoring the backup, note down the usernames and passwords from the agents table for both postgres_exporter and qan-postgresql-pgstatements-agent.
  1. Once the backup restoration is complete, update the PostgreSQL agent usernames and passwords in the production server using the values noted in step 11.

UPDATE agents SET username = ‘XXXXXXX’,password = ‘XXXXXXX’,updated_at = NOW() WHERE agent_id = ‘XXXXXXX’ AND agent_type = ‘postgres_exporter’;

UPDATE agents SET username = ‘XXXXXXX’,password = ‘XXXXXXX’,updated_at = NOW() WHERE agent_id = ‘XXXXXXX’ AND agent_type = ‘qan-postgresql-pgstatements-agent’;

  1. Verify the PostgreSQL agent status in the PMM Service Summary dashboard.
  2. If all services, including the PostgreSQL monitoring, show as Up, the backup restoration and upgrade process on the production server can be considered successful.