Documentation gap: PMM_CLICKHOUSE_IS_CLUSTER / PMM_CLICKHOUSE_CLUSTER_NAME for HA + external ClickHouse

Hi team!

I’d like to flag a documentation gap that cost us several hours of debugging in a PMM 3.6.0 HA setup with an external 3-node ClickHouse cluster (one shard, three replicas, ZooKeeper present, macros configured).

Symptom: PostgreSQL QAN data appeared on only one ClickHouse node — the one matching PMM_CLICKHOUSE_ADDR of the currently-active pmm-managed instance. After a PMM HA leader change, fresh data started landing on a different CH node, while Grafana (pinned to a fixed CH datasource) silently showed “No data”.

Root cause: qan-api2 only creates pmm.metrics as ReplicatedMergeTree when both PMM_CLICKHOUSE_IS_CLUSTER=true and PMM_CLICKHOUSE_CLUSTER_NAME= are set. Without them, the migration produces a plain MergeTree, which is local to whichever CH node ran the migration first — even if the underlying CH cluster is fully cluster-aware. References: qan-api2/main.go (kingpin flags), qan-api2/db.go (NewDB / IsClickhouseClusterReady), qan-api2/migrations/migrations.go (GetEngine(isCluster)).

Where the docs fall short:

The Use external ClickHouse with PMM page does not mention either variable — a user setting up an external CH cluster from this page has no way to know cluster mode exists.

The Install PMM HA Cluster page mentions PMM_CLICKHOUSE_IS_CLUSTER: “1” once in a YAML snippet under “Pre-configured HA variables” with the comment # Enable ClickHouse clustering, never explains what “clustering” means here, and never mentions PMM_CLICKHOUSE_CLUSTER_NAME at all (despite it being equally required).

The two pages aren’t cross-linked, so a user combining HA PMM with an external CH cluster is unlikely to discover the second variable.

Suggested improvement:

List both variables together on the external-ClickHouse reference page with a short note that they control whether qan-api2 migrations create ReplicatedMergeTree vs plain MergeTree.
State that this is a one-shot decision: changing the flags after tables already exist requires dropping the pmm database and re-running migrations (golang-migrate doesn’t retro-convert engines).
Cross-link the external-ClickHouse and HA-cluster pages.
Optional: have qan-api2 log a WARN at startup if it sees a multi-replica cluster in system.clusters but IS_CLUSTER is not set — would catch the misconfiguration without any docs change.

Thanks!