Basic alerting expression

Hello i am trying to monitor when a host gets down, i am using PMM2 and pmm-client in the remote host.

Prometheus documentation indicates the following expression:

up == 0

But in PMM2 when a host gets down, missing metrics doesnt get 0 value, it gets “NULL”(i dont know if it differs because its VictoriaMetrics nor Prometheus), in that way i have tried with the following queries without success:

absent_over_time(up{agent_type="node_exporter"}[1m])
absent(up{agent_type="node_exporter"})

Also i have been testing that trick without success:

(up{agent_type="node_exporter"} or on() vector(0)) == 0

I am testing it executing query in PMM2-grafana explorer section and completely shutting down test server.

Its very strange that a basic monitoring like detecting a host going down is generating all that kind of problems and complex querys, i am sure that i am doing something wrong and the solution is much more simple and elegant.

Have anybody monitored successfully when a host goes down and pmm-client metrics doesnt exists at the time?

I am using the following PMM and client versions.

PMM2:
2.24.0-64.2111181433.7a11d94.el7

pmm2-client:
2.25.0-6.focal

Best regards.

Same problem here.

It should be an easy way to monitor when a host is unreachable.

1 Like

I have the same problem, i have been testing for some days promQL queries to detect unreachable hosts, but I can’t get it to work.

If someone knows how to achieve unreachable hosts monitoring query, please submit it, i would appreciate very much your help.

1 Like

Hmmm…

I wonder if it’s a byproduct of “no info” vs “no value”? thinking out loud: you’re likely using push metrics (i.e. pmm client scrapes exporters locally and then pushes them to server)…but when your host is down metrics just stop showing up vs attempting to scrape a metric and it not having any value. At the same time, Alertmanager is looking in the victoriametrics datastore vs proactively pinging the individual nodes…

So the on() vector(0) is supposed to “fill in blanks” when the key comes back but without a value, it gives it a value of 0. I think that’s on the assumption that the label exists and is blank… in this case I think there isn’t even an undefined label to evaluate…there’s just nothing. Have you tried “absent”?

Again, just thinking out loud but if this is the case it’s an unintended consequence of push vs pull metrics we’ll need to account for!

1 Like

Hello, thank you for answering.

Yes i have tested:
absent(up{agent_type="node_exporter"})

Returning:
0 series returned

I have tried too:
absent_over_time(up{agent_type="node_exporter"}[1m])

Returning 1 during the first 1m, when the 1m has passed it returns 0, i configured Alertmanager with the following configuration:

    expr: sum (absent_over_time(up{agent_type="node_exporter"}[1m])) by (node_name) > 0
    for: 1m

Getting false recovery notifications passed the 1m.

Any idea or tip to solve it?

Best regards.

1 Like

Does it mean that we have to assume that we wont be able to detect unreachable hosts anymore?
Or that you are working for a solution in newer PMM2 versions?

Best regards.

1 Like

Well if it is the case, it means we have an issue to solve but I’m going to ping someone internal to take a look at this thread who’s more of an expert than I am…

1 Like

We verified this internally and confirmed it is an issue. I’ve created a bug in our tracker you can watch for status updates.

1 Like

@b4buFr1k you could try rules such as:

MySQL

absent_over_time(up{job=~"^mysqld_exporter_agent_id.*"}[<insert interval>])

Node

absent_over_time(up{job=~"^node_exporter_agent_id.*"}[<insert interval>])

The regex could have a suffix to make sure that only one of the resolution levels is used if needed, the <insert interval> would be whatever threshold the look-back should occur over.

1 Like

any updates? PMM-9544, this issue is still open. Per comments, I don’t understand what cause this problem. Is the problem related to promethus?

Hi @Dillonlu I suggest you keep working through the Jira issue rather than updates to this thread on the forums, since the fix is going to come from Engineering. If you have a fix we welcome PRs. Thanks,

1 Like