The PMM Server runs as a single pod in Kubernetes. (Because , HA is not available)
We experience metrics gaps during PMM Server restarts or outages.
Could you please confirm the following:
Is this client-side caching available and active in PMM Version 3.2.0?
If not enabled by default, how can we configure it?
Is there any way to observe the cache/queue status or dropped metrics on the client side?
If there’s no robust client-side solution to avoid data loss, what’s the recommended way to run PMM Server in HA mode in Kubernetes—such as running multiple PMM Server pods behind a service?
Alternatively, what are the best practices to minimize or control metrics loss during PMM Server unavailability?
We’re looking for a reliable approach to ensure we don’t lose metrics during brief outages — whether through client or server-side resilience.
Thanks again for your guidance and for the great product!
You can check vmagent_remotewrite_pending_data_bytes metric.
We are working on HA solution in K8s, I don’t have ETA. But work in progress.
As a solution you can use external Victoria Metrics server and use PMM_VM_URL env variable to use that. And configure that VM to be HA. In this case no metrics will be lost even for longer outages.
@nurlan How can they diagnose why local caching isn’t working? Is there a way for them to verify writes are happening locally but not being pushed/replicated to the PMM VM instance when it returns?
Thanks for your swift responses and the valuable insights
On June 6th we encountered an issue in which our PMM Server pod experienced an OOMKilled error and was down for approximately 7 to 8 minutes. What we observed was a clear gap in our metrics during this downtime, and despite our PMM agents being on v3.1.0 (where buffering is enabled by default), they didn’t appear to push the historical data back once the server came online.
What’s particularly puzzling is that when we examined the vmagent_remotewrite_pending_data_bytes metric across all our database instances during the incident, it consistently showed 0. This was quite unexpected, as I was anticipating some pending data if local caching was actively occurring.
Our pmm-agent logs from that period also showed repeated connection errors, such as: time=“2025-06-06T08:38:21.525-07:00” level=error msg=“Failed to connect to pmm-server:443: timeout.” component=client
Considering both the vmagent_remotewrite_pending_data_bytes remaining at 0 and the agent logs indicating connection timeouts, it suggests that the local caching mechanism wasn’t actively buffering the data during the server outage.
We’re trying to understand if we might be missing a specific configuration. Any thoughts on why the buffered metrics weren’t sent once the PMM server was back online would be greatly appreciated as we troubleshoot this further.