Possible HA issue with /srv/pmm-encryption.key not being shared between nodes

Hi team,

I’ve been running a 3-node PMM Server HA cluster (PMM 3.6.0, Docker, deployed per the official HA guide with shared external PostgreSQL and ClickHouse) and ran into something that looks like a regression compared to PMM 2.x — we operated essentially the same topology there for months without ever seeing it.

After a routine reconnect of pmm-agents, QAN data and postgres_exporter metrics started disappearing for several PostgreSQL services. In the agent logs we saw:

pq: password authentication failed for user "AQ+rKT/93psPSlwWLR8Qb0zsQqLDdhfXYNB9EBYk+507mw1Y"

The “username” here is actually a base64-encoded Tink AEAD ciphertext — i.e. the encrypted value from the agents table being shipped to pmm-agent without being decrypted. On the follower nodes, pmm-managed.log was full of:

level=warning msg="decryption: aead_factory: decryption failed"

After a bit of digging it looks like the root cause is that each pmm-server node generates its own /srv/pmm-encryption.key on first start, but in HA mode all three nodes share one Postgres DB. So rows in the agents table get encrypted with the active node’s key, and the followers — having a different local key — cannot decrypt them. The symptom shows up later, when an agent reconnects to a follower and receives stale/garbled credentials in SetState, and is otherwise silent (only a warning log line on the follower).

The fix in our case was straightforward: copy /srv/pmm-encryption.key from the active node to the two followers and restart their containers. QAN data started flowing again within a minute.

A few small suggestions, if this matches what others may be hitting:

  1. It seems the HA setup guide doesn’t currently mention /srv/pmm-encryption.key at all. It would probably help future operators if the doc had a short note like “In HA deployments, all pmm-server nodes must share the same /srv/pmm-encryption.key. Generate it once and place it on every node before starting their containers.”

  2. The good news is that the building block for this already exists: PMM-14429 / PR #4683 added a --generate-key flag to pmm-encryption-rotation that prints a fresh key to stdout without touching the database. That looks like exactly what an HA bootstrap step would want — it just doesn’t seem to be referenced from the HA guide today. A short example in the doc using that flag would probably close the loop.

  3. It might also be worth considering a startup-time sanity check in pmm-managed that tries to decrypt one row from agents and fails loudly if the key doesn’t match. Right now the failure is silent (only a warning) and the cluster keeps shipping ciphertext to agents in place of usernames, which made this pretty hard to track down.

We didn’t see this in PMM 2.x, presumably because at-rest encryption of agent credentials wasn’t in place there yet, so the same HA topology “just worked”. Just flagging in case others run into it.

Thanks!

Hi @ddkozyreva ,

I agree it’s definitely a miss. We did implement it in the other Helm chart, which we call “Full HA”, but somehow missed to do the same in the old one.

You can take a look at the code:

We will definitely work on a fix for this Helm chart, thanks for highlighting it!