PMM databases showed intermittent "down" status due to scrape timeouts being too short

zoyaaboujaish · March 3, 2026, 5:43pm

Is there a better way to resolve this issue permanently?**

Problem**

PMM databases showed intermittent “down” status due to scrape timeouts being too short:

HR jobs: 9s (too short)
MR jobs: 13.5s (too short)
LR jobs: 54s (too short)
Global: 54s (overridden by job-level)

Solution Applied

Temporarily increased scrape timeouts in /etc/victoriametrics-promscrape.yml:

HR jobs: 9s → 60s
MR jobs: 13.5s → 60s
LR jobs: 54s → 120s
Global: 54s → 120s

Step-by-Step Commands & Outputs

Step 1: Bash into the pod

kubectl exec -it pmm-server-0 -n pmm -- bash

Output:

[pmm@pmm-server-0 opt] #

Step 2: Create backup

cp /etc/victoriametrics-promscrape.yml /tmp/victoriametrics-promscrape.yml.backup.$(date +%Y%m%d_%H%M%S)
ls -lh /tmp/victoriametrics-promscrape.yml.backup.*

Output:

-rw-r--r-- 1 pmm pmm 122K Feb 10 14:15 /tmp/victoriametrics-promscrape.yml.backup.20260210_141530

Step 3: Check current timeout values

grep "scrape_timeout:" /etc/victoriametrics-promscrape.yml | sort | uniq -c

Output:

47 scrape_timeout: 13500ms
43 scrape_timeout: 54s
1 scrape_timeout: 54s
45 scrape_timeout: 9s

Step 4: Check file permissions (discovered permission issue)

ls -la /etc/victoriametrics-promscrape.yml
touch /etc/test-write 2>&1

Output:

-rw-rw-r-- 1 pmm root 124237 Feb 10 14:30 /etc/victoriametrics-promscrape.yml
touch: cannot touch '/etc/test-write': Permission denied

Finding: Cannot write directly to /etc directory, but can overwrite the file since we own it.

Step 5: Apply changes to real file (via /tmp)

# Copy to /tmp for editing
cp /etc/victoriametrics-promscrape.yml /tmp/victoriametrics-promscrape.yml.edit
# Apply all three replacements
sed -i 's/scrape_timeout: 54s$/scrape_timeout: 120s/' /tmp/victoriametrics-promscrape.yml.edit
sed -i 's/scrape_timeout: 9s$/scrape_timeout: 60s/' /tmp/victoriametrics-promscrape.yml.edit
sed -i 's/scrape_timeout: 13500ms$/scrape_timeout: 60s/' /tmp/victoriametrics-promscrape.yml.edit
# Verify edited file
grep "scrape_timeout:" /tmp/victoriametrics-promscrape.yml.edit | sort | uniq -c

Output:

43 scrape_timeout: 120s
1 scrape_timeout: 120s
92 scrape_timeout: 60s

Step 6: Copy edited file back to /etc

cp /tmp/victoriametrics-promscrape.yml.edit /etc/victoriametrics-promscrape.yml
# Verify real file
grep "scrape_timeout:" /etc/victoriametrics-promscrape.yml | sort | uniq -c

Output:

43 scrape_timeout: 120s
1 scrape_timeout: 120s
92 scrape_timeout: 60s

Step 7: Verify file structure

grep -A 2 "^global:" /etc/victoriametrics-promscrape.yml
grep -A 10 "postgres_exporter.*_hr" /etc/victoriametrics-promscrape.yml | grep -A 2 "scrape_timeout:" | head -3

Output:

global:
scrape_interval: 1m
scrape_timeout: 120s

scrape_timeout: 60s

Step 8: Monitor config reload

tail -f /srv/logs/victoriametrics.log | grep -i "SIGHUP\|reloading"

Output:

2026-02-10T14:40:02.358Z info SIGHUP received; reloading Prometheus configs from "/etc/victoriametrics-promscrape.yml"
2026-02-10T14:41:01.200Z info SIGHUP received; reloading Prometheus configs from "/etc/victoriametrics-promscrape.yml"
...

Step 9: Verify no errors and service status

Output:

(no errors found)
victoriametrics RUNNING pid 3064549, uptime 0:05:46
2026-02-10T14:43:01.243Z info nothing changed in "/etc/victoriametrics-promscrape.yml"

Agustin_G · March 4, 2026, 4:46am

Hi @zoyaaboujaish,

If I’m understanding correctly, you were having issues that were solved by increasing the timeouts. The timeouts are tied to the different values for each resolution, so you just need to increase each of the problematic resolution metric times that are giving you problems. This means that you will have less data points, of course, but it’s the only supported way to increase the timeouts.

You can easily change metric resolution times in the settings:

This should also increase the timeouts.

If you are consistently having issues you may need to check the network between PMM server and its clients, or check that PMM server has enough resources to deal with the amount of clients connected to it.

zoyaaboujaish · March 4, 2026, 6:14am

Hi @Agustin_G ,

Thank you for your response. Increasing the timeouts did resolve the issue but these values are getting reset with every restart of the pmm-server or upgrade.

Changing via the UI also does not appear to be supported with the current deployment via helm into kubernetes. From what I understood, the values getting changed as part of Metrics resolution - Percona Monitoring and Management are the scrape_intervals but the issue is actually coming from the scrape_timeout values. The goal is to increase the timeout value without increasing the scrape_interval.

I was unable to find out what other users may have done in this scenario, hoping the forum might reveal some insight on how best to manipulate these values with PMM being self-hosted and manually installed via helm in K8s.

Additionally, the networking was validated and the issue seems to be restricted to only a very select few databases which are sometimes too occupied to respond before the timeout hits. PMM server resources were also increased heavily to eliminate any other potential issues.

PMM version: 3.5.0

Agustin_G · March 4, 2026, 7:45pm

Hi @zoyaaboujaish,

Increasing the timeouts did resolve the issue but these values are getting reset with every restart of the pmm-server or upgrade.

Yes, this is because it’s currently not possible to increase just the timeouts.

The only way to increase the timeouts is to increase the metric resolution itself (or, as you referred to them, scrape_intervals). You can do so in the settings page I sent.

Additionally, the networking was validated and the issue seems to be restricted to only a very select few databases which are sometimes too occupied to respond before the timeout hits. PMM server resources were also increased heavily to eliminate any other potential issues.

Perfect, thanks for the update on it.

Topic		Replies	Views
PMM2 scrape duration PMM 2.x	7	1330	October 14, 2020
scrape data from client with wrong interval PMM 1.x	1	958	February 20, 2019
External service added through UI scraping at interval of 1s. How to increase to 15s? Percona Monitoring and Management (PMM) pmm , closed-no-reply	0	628	March 10, 2023
PMM MySQL low-res exporter : context deadline exceeded PMM 2.x	3	1447	October 13, 2020
PMM scraping every 5 minutes after upgrade to 1.10 PMM 1.x	1	748	July 4, 2018

PMM databases showed intermittent "down" status due to scrape timeouts being too short

Is there a better way to resolve this issue permanently?**

Solution Applied

Step-by-Step Commands & Outputs

Step 1: Bash into the pod

Step 2: Create backup

Step 3: Check current timeout values

Step 4: Check file permissions (discovered permission issue)

Step 5: Apply changes to real file (via /tmp)

Step 6: Copy edited file back to /etc

Step 7: Verify file structure

Step 8: Monitor config reload

Step 9: Verify no errors and service status

Related topics