PMM Client-Side Caching in 3.2.0 & HA Setup for PMM Server in Kubernetes

Hi Percona Team,

We’re currently using PMM Server and Clients v3.1.0, and planning to upgrade to v3.2.0.

I read that PMM 2.33 introduced client-side caching using the embedded vmagent to reduce metrics loss during short PMM Server downtimes.
:pushpin: Ref: PMM V2.33: Offline Metric Collection, Guided Alerting Tour, Security Fixes, and More!

Our current setup is as follows:

  • PMM Agents are deployed on EC2 instances.
  • The PMM Server runs as a single pod in Kubernetes. (Because , HA is not available)
  • We experience metrics gaps during PMM Server restarts or outages.

Could you please confirm the following:

  1. :white_check_mark: Is this client-side caching available and active in PMM Version 3.2.0?
  2. :gear: If not enabled by default, how can we configure it?
  3. :magnifying_glass_tilted_left: Is there any way to observe the cache/queue status or dropped metrics on the client side?
  4. :red_question_mark:If there’s no robust client-side solution to avoid data loss, what’s the recommended way to run PMM Server in HA mode in Kubernetes—such as running multiple PMM Server pods behind a service?
  5. :thinking: Alternatively, what are the best practices to minimize or control metrics loss during PMM Server unavailability?

We’re looking for a reliable approach to ensure we don’t lose metrics during brief outages — whether through client or server-side resilience.

Thanks again for your guidance and for the great product!

  1. Yes, and should be active in 3.1.0
  2. it’s enabled by default
  3. You can check vmagent_remotewrite_pending_data_bytes metric.
  4. We are working on HA solution in K8s, I don’t have ETA. But work in progress.
  5. As a solution you can use external Victoria Metrics server and use PMM_VM_URL env variable to use that. And configure that VM to be HA. In this case no metrics will be lost even for longer outages.
1 Like

@nurlan How can they diagnose why local caching isn’t working? Is there a way for them to verify writes are happening locally but not being pushed/replicated to the PMM VM instance when it returns?

PMM Client has limited size of cache.
How long was outage?
Probably will be good to check pmm-agent logs.

Hey @nurlan,

Thanks for your swift responses and the valuable insights

On June 6th we encountered an issue in which our PMM Server pod experienced an OOMKilled error and was down for approximately 7 to 8 minutes. What we observed was a clear gap in our metrics during this downtime, and despite our PMM agents being on v3.1.0 (where buffering is enabled by default), they didn’t appear to push the historical data back once the server came online.

What’s particularly puzzling is that when we examined the vmagent_remotewrite_pending_data_bytes metric across all our database instances during the incident, it consistently showed 0. This was quite unexpected, as I was anticipating some pending data if local caching was actively occurring.

Our pmm-agent logs from that period also showed repeated connection errors, such as:
time=“2025-06-06T08:38:21.525-07:00” level=error msg=“Failed to connect to pmm-server:443: timeout.” component=client

Considering both the vmagent_remotewrite_pending_data_bytes remaining at 0 and the agent logs indicating connection timeouts, it suggests that the local caching mechanism wasn’t actively buffering the data during the server outage.

We’re trying to understand if we might be missing a specific configuration. Any thoughts on why the buffered metrics weren’t sent once the PMM server was back online would be greatly appreciated as we troubleshoot this further.

Thanks for your continued support!