Basic alerting expression

b4buFr1k · February 1, 2022, 7:26am

Hello i am trying to monitor when a host gets down, i am using PMM2 and pmm-client in the remote host.

Prometheus documentation indicates the following expression:

up == 0

But in PMM2 when a host gets down, missing metrics doesnt get 0 value, it gets “NULL”(i dont know if it differs because its VictoriaMetrics nor Prometheus), in that way i have tried with the following queries without success:

absent_over_time(up{agent_type="node_exporter"}[1m])
absent(up{agent_type="node_exporter"})

Also i have been testing that trick without success:

(up{agent_type="node_exporter"} or on() vector(0)) == 0

I am testing it executing query in PMM2-grafana explorer section and completely shutting down test server.

Its very strange that a basic monitoring like detecting a host going down is generating all that kind of problems and complex querys, i am sure that i am doing something wrong and the solution is much more simple and elegant.

Have anybody monitored successfully when a host goes down and pmm-client metrics doesnt exists at the time?

I am using the following PMM and client versions.

PMM2:
2.24.0-64.2111181433.7a11d94.el7

pmm2-client:
2.25.0-6.focal

Best regards.

caresal · February 1, 2022, 11:51am

Same problem here.

It should be an easy way to monitor when a host is unreachable.

kr0m · February 1, 2022, 1:00pm

I have the same problem, i have been testing for some days promQL queries to detect unreachable hosts, but I can’t get it to work.

If someone knows how to achieve unreachable hosts monitoring query, please submit it, i would appreciate very much your help.

steve.hoffman · February 1, 2022, 7:20pm

Hmmm…

I wonder if it’s a byproduct of “no info” vs “no value”? thinking out loud: you’re likely using push metrics (i.e. pmm client scrapes exporters locally and then pushes them to server)…but when your host is down metrics just stop showing up vs attempting to scrape a metric and it not having any value. At the same time, Alertmanager is looking in the victoriametrics datastore vs proactively pinging the individual nodes…

So the on() vector(0) is supposed to “fill in blanks” when the key comes back but without a value, it gives it a value of 0. I think that’s on the assumption that the label exists and is blank… in this case I think there isn’t even an undefined label to evaluate…there’s just nothing. Have you tried “absent”?

Again, just thinking out loud but if this is the case it’s an unintended consequence of push vs pull metrics we’ll need to account for!

b4buFr1k · February 2, 2022, 7:48am

Hello, thank you for answering.

Yes i have tested:
absent(up{agent_type="node_exporter"})

Returning:
0 series returned

I have tried too:
absent_over_time(up{agent_type="node_exporter"}[1m])

Returning 1 during the first 1m, when the 1m has passed it returns 0, i configured Alertmanager with the following configuration:

    expr: sum (absent_over_time(up{agent_type="node_exporter"}[1m])) by (node_name) > 0
    for: 1m

Getting false recovery notifications passed the 1m.

Any idea or tip to solve it?

Best regards.

b4buFr1k · February 2, 2022, 9:35am

Does it mean that we have to assume that we wont be able to detect unreachable hosts anymore?
Or that you are working for a solution in newer PMM2 versions?

Best regards.

steve.hoffman · February 2, 2022, 12:11pm

Well if it is the case, it means we have an issue to solve but I’m going to ping someone internal to take a look at this thread who’s more of an expert than I am…

steve.hoffman · February 2, 2022, 2:21pm

We verified this internally and confirmed it is an issue. I’ve created a bug in our tracker you can watch for status updates.

Ceri_Williams · March 10, 2022, 5:10pm

@b4buFr1k you could try rules such as:

MySQL

absent_over_time(up{job=~"^mysqld_exporter_agent_id.*"}[<insert interval>])

Node

absent_over_time(up{job=~"^node_exporter_agent_id.*"}[<insert interval>])

The regex could have a suffix to make sure that only one of the resolution levels is used if needed, the <insert interval> would be whatever threshold the look-back should occur over.

Dillonlu · February 22, 2023, 6:19am

any updates? PMM-9544, this issue is still open. Per comments, I don’t understand what cause this problem. Is the problem related to promethus?

Michael_Coburn · February 27, 2023, 3:20pm

Hi @Dillonlu I suggest you keep working through the Jira issue rather than updates to this thread on the forums, since the fix is going to come from Engineering. If you have a fix we welcome PRs. Thanks,

Topic		Replies	Views
How to invesigate integrated alert not show a mysql was done PMM 2.x	7	762	February 17, 2023
PMM does not display metrics when either agent or target MongoDB is down PMM 1.x	7	1702	August 17, 2017
Alerts for whether PMM client is running or not Percona Monitoring and Management (PMM) percona , mongodb , mongodb-exporter	1	690	August 18, 2023
Can we monitor or setup alerts if PMM is unable to reach a client installed on one of the server it is being monitored MySQL & MariaDB pmm , mysql	1	399	December 4, 2023
Agent - Server connectivity PMM 2.x pmm	10	1657	January 31, 2022

Basic alerting expression

Related topics