Hello i am trying to monitor when a host gets down, i am using PMM2 and pmm-client in the remote host.
Prometheus documentation indicates the following expression:
up == 0
But in PMM2 when a host gets down, missing metrics doesnt get 0 value, it gets “NULL”(i dont know if it differs because its VictoriaMetrics nor Prometheus), in that way i have tried with the following queries without success:
Also i have been testing that trick without success:
(up{agent_type="node_exporter"} or on() vector(0)) == 0
I am testing it executing query in PMM2-grafana explorer section and completely shutting down test server.
Its very strange that a basic monitoring like detecting a host going down is generating all that kind of problems and complex querys, i am sure that i am doing something wrong and the solution is much more simple and elegant.
Have anybody monitored successfully when a host goes down and pmm-client metrics doesnt exists at the time?
I wonder if it’s a byproduct of “no info” vs “no value”? thinking out loud: you’re likely using push metrics (i.e. pmm client scrapes exporters locally and then pushes them to server)…but when your host is down metrics just stop showing up vs attempting to scrape a metric and it not having any value. At the same time, Alertmanager is looking in the victoriametrics datastore vs proactively pinging the individual nodes…
So the on() vector(0) is supposed to “fill in blanks” when the key comes back but without a value, it gives it a value of 0. I think that’s on the assumption that the label exists and is blank… in this case I think there isn’t even an undefined label to evaluate…there’s just nothing. Have you tried “absent”?
Again, just thinking out loud but if this is the case it’s an unintended consequence of push vs pull metrics we’ll need to account for!
Does it mean that we have to assume that we wont be able to detect unreachable hosts anymore?
Or that you are working for a solution in newer PMM2 versions?
Well if it is the case, it means we have an issue to solve but I’m going to ping someone internal to take a look at this thread who’s more of an expert than I am…
The regex could have a suffix to make sure that only one of the resolution levels is used if needed, the <insert interval> would be whatever threshold the look-back should occur over.
Hi @Dillonlu I suggest you keep working through the Jira issue rather than updates to this thread on the forums, since the fix is going to come from Engineering. If you have a fix we welcome PRs. Thanks,