Hi team,
I’ve been running a 3-node PMM Server HA cluster (PMM 3.6.0, Docker, deployed per the official HA guide with shared external PostgreSQL and ClickHouse) and ran into something that looks like a regression compared to PMM 2.x — we operated essentially the same topology there for months without ever seeing it.
After a routine reconnect of pmm-agents, QAN data and postgres_exporter metrics started disappearing for several PostgreSQL services. In the agent logs we saw:
pq: password authentication failed for user "AQ+rKT/93psPSlwWLR8Qb0zsQqLDdhfXYNB9EBYk+507mw1Y"
The “username” here is actually a base64-encoded Tink AEAD ciphertext — i.e. the encrypted value from the agents table being shipped to pmm-agent without being decrypted. On the follower nodes, pmm-managed.log was full of:
level=warning msg="decryption: aead_factory: decryption failed"
After a bit of digging it looks like the root cause is that each pmm-server node generates its own /srv/pmm-encryption.key on first start, but in HA mode all three nodes share one Postgres DB. So rows in the agents table get encrypted with the active node’s key, and the followers — having a different local key — cannot decrypt them. The symptom shows up later, when an agent reconnects to a follower and receives stale/garbled credentials in SetState, and is otherwise silent (only a warning log line on the follower).
The fix in our case was straightforward: copy /srv/pmm-encryption.key from the active node to the two followers and restart their containers. QAN data started flowing again within a minute.
A few small suggestions, if this matches what others may be hitting:
-
It seems the HA setup guide doesn’t currently mention
/srv/pmm-encryption.keyat all. It would probably help future operators if the doc had a short note like “In HA deployments, all pmm-server nodes must share the same/srv/pmm-encryption.key. Generate it once and place it on every node before starting their containers.” -
The good news is that the building block for this already exists: PMM-14429 / PR #4683 added a
--generate-keyflag topmm-encryption-rotationthat prints a fresh key to stdout without touching the database. That looks like exactly what an HA bootstrap step would want — it just doesn’t seem to be referenced from the HA guide today. A short example in the doc using that flag would probably close the loop. -
It might also be worth considering a startup-time sanity check in pmm-managed that tries to decrypt one row from
agentsand fails loudly if the key doesn’t match. Right now the failure is silent (only awarning) and the cluster keeps shipping ciphertext to agents in place of usernames, which made this pretty hard to track down.
We didn’t see this in PMM 2.x, presumably because at-rest encryption of agent credentials wasn’t in place there yet, so the same HA topology “just worked”. Just flagging in case others run into it.
Thanks!