Hi - I’m trying to integrate Prometheus Alert manager with PMM 2.14 and now that Prometheus is replaced with Victoriametrics in 2.14 , Do I need to separately install Prometheus on my PMM 2.14 instance ? if its already installed, how can I start or stop Prometheus ? I’m following the below link to set up the alert manager
I’m mobile so forgive any typos but our tech preview of integrated alerting doesn’t require any additional software installation. VictoriaMetrics is compatible so you’ll be able to create alerts as if it were Prometheus. The basic steps are
Enable alerting under settings
Setup one or more communication channels
Create an alert template or clone one of our existing ones
Create an alert rule with the appropriate criteria and choose the appropriate template
Some info is here and there’s more coming but not yet published and I can’t seem to find it on mobile. See if this gets you started and post any follow up here and we will see what we can do to get you alerting!
Whoops just reread and if you are connecting to an existing alert manager instance you still don’t need Prometheus. Just point your pmm instance to your external alert manager in the settings and paste your complete alert rule file.
Got it ! I’ll try that out !! Thanks for clarification.
On your other note about Integrated alerting, I’ve been spending a lot of time on it and trying to get it work, but for some reason , I’m only able to receive Mysql down alerts but not the OS related metrics such as high CPU , memory, thats the reason why I’m exploring other options. Is there anything else that I should be looking at ? I believe enhanced monitoring is turned on on the instances and also rds_exporter seems to be pulling data well with cloudmetrics since I see all those details in the home dashboard. Its just the alerting thats not functioning . Please suggest what else I need to look at to fix this . If I can get this working , I may not need other alert manager instance . I’ve also looked at the postgres db and the ia_rules table which shows the params column as
---
templates:
- name: CPU Exceeds threshold
version: 1
summary: CPU Exceeds threshold
expr: |-
100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[2m])) * 100)
> [[ .threshold ]]
params:
- name: threshold
summary: A percentage from configured maximum
unit: '%'
type: float
range: [0, 100]
value: 80
for: 2m
severity: warning
labels:
foo: bar
annotations:
description: |-
CPU exceeds [[ .threshold ]]% on {{ $labels.instance }}
VALUE = {{ $value }}
LABELS: {{ $labels }}
summary: CPU exceeds threshold on (instance {{ $labels.instance }})
you can then create an alert rule based on that template and set your threshold to whatever % you want to be alerted after and duration the criteria must be true to alert (default is 2 min from this template but you can drop it down to 30 seconds or whatever you want.
OK . Thanks much . will try it out . But so far I have been working with built-in templates . Is anything wrong with the built-in templates or is it something with my filters/labels ? All the errors I’ve listed above are for the built-in templates
There’s a bug that will be fixed in 2.15.0 (I think) as some templates are converting the ‘<’ as < but you’d see that in the /srv/logs/vmalert.log as a reason it couldn’t parse.
The query error there make it seem like victoriametrics isn’t running (error getting response from) which I’ve not seen Victoriametrics crash but I have seen vmalert crash with my poorly formed rules.
As for the inconsistency of the output…we’ll fix that! you can manually edit the /etc/ia/rules/ files to change “instance” to “node_name” in the output to give a more useful value…I had to ask this as well so you’re not alone!
Got it . Yes , I do see several errors in log files
|2021-02-12T07:29:34.026Z|error|/home/builder/rpm/BUILD/VictoriaMetrics-pmm-6401-v1.50.2/app/vmalert/main.go:95|error while reloading rules: cannot parse configuration file: errors(2): invalid group "PMM Integrated Alerting" in file "/etc/ia/rules/03a35104-bd29-4e6a-9cfc-7d75da41b27e.yml": invalid expression for rule "PMM Integrated Alerting"."/rule_id/03a35104-bd29-4e6a-9cfc-7d75da41b27e": cannot recognize "< 1"; unparsed data: "mysql_global_status_uptime< 1"|
|---|---|---|---|
|2021-02-12T07:29:44.337Z|error|/home/builder/rpm/BUILD/VictoriaMetrics-pmm-6401-v1.50.2/app/vmalert/main.go:95|error while reloading rules: cannot parse configuration file: errors(2): invalid group "PMM Integrated Alerting" in file "/etc/ia/rules/03a35104-bd29-4e6a-9cfc-7d75da41b27e.yml": invalid expression for rule "PMM Integrated Alerting"."/rule_id/03a35104-bd29-4e6a-9cfc-7d75da41b27e": cannot recognize "< 1"; unparsed data: "mysql_global_status_uptime< 1"|
2021-02-12T07:29:44.337Z error /home/builder/rpm/BUILD/VictoriaMetrics-pmm-6401-v1.50.2/app/vmalert/main.go:95 error while reloading rules: cannot parse configuration file: errors(2): invalid group "PMM Integrated Alerting" in file "/etc/ia/rules/03a35104-bd29-4e6a-9cfc-7d75da41b27e.yml": invalid expression for rule "PMM Integrated Alerting"."/rule_id/03a35104-bd29-4e6a-9cfc-7d75da41b27e": cannot recognize "< 1"; unparsed data: "mysql_global_status_uptime< 1"
invalid group "PMM Integrated Alerting" in file "/etc/ia/rules/660fda01-27b9-42be-8892-095c2b1880dc.yml": invalid expression for rule "PMM Integrated Alerting"."/rule_id/660fda01-27b9-42be-8892-095c2b1880dc": cannot recognize "< 50"; unparsed data: "100< 50"
2021-02-12T19:12:44.787Z error /home/builder/rpm/BUILD/VictoriaMetrics-pmm-6401-v1.50.2/app/vmalert/group.go:245 group "PMM Integrated Alerting": rule "/rule_id/a1af9575-f6df-4700-ae76-93203d297644": failed to execute: failed to execute query "sum by (service_name, node_name) (mysql_up) == 0": error getting response from http://127.0.0.1:9090/prometheus/api/v1/query?query=sum+by+%28service_name%2C+node_name%29+%28mysql_up%29+%3D%3D+0: Post "http://127.0.0.1:9090/prometheus/api/v1/query?query=sum+by+%28service_name%2C+node_name%29+%28mysql_up%29+%3D%3D+0": EOF
2021-02-12T22:02:45.953Z error /home/builder/rpm/BUILD/VictoriaMetrics-pmm-6401-v1.50.2/app/vmalert/remotewrite/remotewrite.go:213 attempt 1 to send request failed: error while sending request to http://127.0.0.1:9090/prometheus/api/v1/write: Post "http://127.0.0.1:9090/prometheus/api/v1/write": EOF; Data len 325(325)
can I replace “<” with < and see if that works ? or will I break IA ? I see such errors for mysql_status and out of memory templates, but atleast the CPU alerts should have been working. I don’t see any errors related to it , but no alerts as well …