PMM databases showed intermittent "down" status due to scrape timeouts being too short

Is there a better way to resolve this issue permanently?**

Problem**

PMM databases showed intermittent “down” status due to scrape timeouts being too short:

  • HR jobs: 9s (too short)

  • MR jobs: 13.5s (too short)

  • LR jobs: 54s (too short)

  • Global: 54s (overridden by job-level)

Solution Applied

Temporarily increased scrape timeouts in /etc/victoriametrics-promscrape.yml:

  • HR jobs: 9s60s

  • MR jobs: 13.5s60s

  • LR jobs: 54s120s

  • Global: 54s120s

Step-by-Step Commands & Outputs

Step 1: Bash into the pod

kubectl exec -it pmm-server-0 -n pmm -- bash

Output:

[pmm@pmm-server-0 opt] #

Step 2: Create backup

cp /etc/victoriametrics-promscrape.yml /tmp/victoriametrics-promscrape.yml.backup.$(date +%Y%m%d_%H%M%S)
ls -lh /tmp/victoriametrics-promscrape.yml.backup.*

Output:

-rw-r--r-- 1 pmm pmm 122K Feb 10 14:15 /tmp/victoriametrics-promscrape.yml.backup.20260210_141530

Step 3: Check current timeout values

grep "scrape_timeout:" /etc/victoriametrics-promscrape.yml | sort | uniq -c

Output:

47 scrape_timeout: 13500ms
43 scrape_timeout: 54s
1 scrape_timeout: 54s
45 scrape_timeout: 9s

Step 4: Check file permissions (discovered permission issue)

ls -la /etc/victoriametrics-promscrape.yml
touch /etc/test-write 2>&1

Output:

-rw-rw-r-- 1 pmm root 124237 Feb 10 14:30 /etc/victoriametrics-promscrape.yml
touch: cannot touch '/etc/test-write': Permission denied

Finding: Cannot write directly to /etc directory, but can overwrite the file since we own it.

Step 5: Apply changes to real file (via /tmp)

# Copy to /tmp for editing
cp /etc/victoriametrics-promscrape.yml /tmp/victoriametrics-promscrape.yml.edit
# Apply all three replacements
sed -i 's/scrape_timeout: 54s$/scrape_timeout: 120s/' /tmp/victoriametrics-promscrape.yml.edit
sed -i 's/scrape_timeout: 9s$/scrape_timeout: 60s/' /tmp/victoriametrics-promscrape.yml.edit
sed -i 's/scrape_timeout: 13500ms$/scrape_timeout: 60s/' /tmp/victoriametrics-promscrape.yml.edit
# Verify edited file
grep "scrape_timeout:" /tmp/victoriametrics-promscrape.yml.edit | sort | uniq -c

Output:

43 scrape_timeout: 120s
1 scrape_timeout: 120s
92 scrape_timeout: 60s

Step 6: Copy edited file back to /etc

cp /tmp/victoriametrics-promscrape.yml.edit /etc/victoriametrics-promscrape.yml
# Verify real file
grep "scrape_timeout:" /etc/victoriametrics-promscrape.yml | sort | uniq -c

Output:

43 scrape_timeout: 120s
1 scrape_timeout: 120s
92 scrape_timeout: 60s

Step 7: Verify file structure

grep -A 2 "^global:" /etc/victoriametrics-promscrape.yml
grep -A 10 "postgres_exporter.*_hr" /etc/victoriametrics-promscrape.yml | grep -A 2 "scrape_timeout:" | head -3

Output:

global:
scrape_interval: 1m
scrape_timeout: 120s

scrape_timeout: 60s

Step 8: Monitor config reload

tail -f /srv/logs/victoriametrics.log | grep -i "SIGHUP\|reloading"

Output:

2026-02-10T14:40:02.358Z info SIGHUP received; reloading Prometheus configs from "/etc/victoriametrics-promscrape.yml"
2026-02-10T14:41:01.200Z info SIGHUP received; reloading Prometheus configs from "/etc/victoriametrics-promscrape.yml"
...

Step 9: Verify no errors and service status

tail -100 /srv/logs/victoriametrics.log | grep -i "error" | grep -v "warn\|cannot scrape" | tail -10
supervisorctl status victoriametrics
tail -20 /srv/logs/victoriametrics.log | grep -i "reloading\|nothing changed"

Output:

(no errors found)
victoriametrics RUNNING pid 3064549, uptime 0:05:46
2026-02-10T14:43:01.243Z info nothing changed in "/etc/victoriametrics-promscrape.yml"

Hi @zoyaaboujaish,

If I’m understanding correctly, you were having issues that were solved by increasing the timeouts. The timeouts are tied to the different values for each resolution, so you just need to increase each of the problematic resolution metric times that are giving you problems. This means that you will have less data points, of course, but it’s the only supported way to increase the timeouts.

You can easily change metric resolution times in the settings:

This should also increase the timeouts.

If you are consistently having issues you may need to check the network between PMM server and its clients, or check that PMM server has enough resources to deal with the amount of clients connected to it.

Hi @Agustin_G ,

Thank you for your response. Increasing the timeouts did resolve the issue but these values are getting reset with every restart of the pmm-server or upgrade.

Changing via the UI also does not appear to be supported with the current deployment via helm into kubernetes. From what I understood, the values getting changed as part of Metrics resolution - Percona Monitoring and Management are the scrape_intervals but the issue is actually coming from the scrape_timeout values. The goal is to increase the timeout value without increasing the scrape_interval.

I was unable to find out what other users may have done in this scenario, hoping the forum might reveal some insight on how best to manipulate these values with PMM being self-hosted and manually installed via helm in K8s.

Additionally, the networking was validated and the issue seems to be restricted to only a very select few databases which are sometimes too occupied to respond before the timeout hits. PMM server resources were also increased heavily to eliminate any other potential issues.

PMM version: 3.5.0

Hi @zoyaaboujaish,

Increasing the timeouts did resolve the issue but these values are getting reset with every restart of the pmm-server or upgrade.

Yes, this is because it’s currently not possible to increase just the timeouts.

The only way to increase the timeouts is to increase the metric resolution itself (or, as you referred to them, scrape_intervals). You can do so in the settings page I sent.

Additionally, the networking was validated and the issue seems to be restricted to only a very select few databases which are sometimes too occupied to respond before the timeout hits. PMM server resources were also increased heavily to eliminate any other potential issues.

Perfect, thanks for the update on it.