Hi! I have two problems with HA mode on the largest installations. And both of them probably caused by this commit which added `&& !HA.Enabled` to the condition that previously skipped config validation for external VM setups. I’d like to know why these conditions were added and whether this is a mistake. I’ll list my problems caused by this below for me:
- This appears to have accidentally re-enabled a local dry-run validation of the full scrape config on every update. The dry-run spawns a local `victoriametrics` process — which seems unnecessary when using an external VM, since the actual scrape config consumer is `vmagent`, a different binary. On large installations (3000+ agents) this takes several minutes, exceeds the context timeout, and crashes the leader repeatedly, destabilizing Raft quorum.
Proposed fix: skip `validateConfig` when ExternalVM is configured, keeping the `reload` call intact.
- Same commit, likely same root cause. Two conditions in `populateConfig` both had `&& !HA.Enabled` guards. With ExternalVM + HA both evaluate to false, so neither scrape config block gets added. The generated `victoriametrics-promscrape.yml` ends up with only internal jobs and no agent targets — metrics collection stops silently. Proposed fix: remove the `!HA.Enabled` constraint from both conditions, as HA state doesn’t seem to affect which scrape configs should be generated.
Tested on PMM 3.6.0, Docker HA cluster, 3 nodes, ~6900 agents, external PostgreSQL, external VictoriaMetrics.