PMM 2.14 - Prometheus Alert manager

Hi - I’m trying to integrate Prometheus Alert manager with PMM 2.14 and now that Prometheus is replaced with Victoriametrics in 2.14 , Do I need to separately install Prometheus on my PMM 2.14 instance ? if its already installed, how can I start or stop Prometheus ? I’m following the below link to set up the alert manager

Please help !

I’m mobile so forgive any typos but our tech preview of integrated alerting doesn’t require any additional software installation. VictoriaMetrics is compatible so you’ll be able to create alerts as if it were Prometheus. The basic steps are

  • Enable alerting under settings
  • Setup one or more communication channels
  • Create an alert template or clone one of our existing ones
  • Create an alert rule with the appropriate criteria and choose the appropriate template

Some info is here and there’s more coming but not yet published and I can’t seem to find it on mobile. See if this gets you started and post any follow up here and we will see what we can do to get you alerting!

Edit: found the faq that will be published soon

1 Like

Whoops just reread and if you are connecting to an existing alert manager instance you still don’t need Prometheus. Just point your pmm instance to your external alert manager in the settings and paste your complete alert rule file.

1 Like

Got it ! I’ll try that out !! Thanks for clarification.

On your other note about Integrated alerting, I’ve been spending a lot of time on it and trying to get it work, but for some reason , I’m only able to receive Mysql down alerts but not the OS related metrics such as high CPU , memory, thats the reason why I’m exploring other options. Is there anything else that I should be looking at ? I believe enhanced monitoring is turned on on the instances and also rds_exporter seems to be pulling data well with cloudmetrics since I see all those details in the home dashboard. Its just the alerting thats not functioning . Please suggest what else I need to look at to fix this . If I can get this working , I may not need other alert manager instance . I’ve also looked at the postgres db and the ia_rules table which shows the params column as

[{“bool”: false, “name”: “threshold”, “type”: “float”, “float”: 50, “string”: “”}]

Does it mean its not taking any threshold values ?

Here are the errors from the error log

2021-02-06T01:21:44.786Z error /home/builder/rpm/BUILD/VictoriaMetrics-pmm-6401-v1.50.2/app/vmalert/group.go:245 group “PMM Integrated Alerting”: rule “/rule_id/a1af9575-f6df-4700-ae76-93203d297644”: failed to execute: failed to execute query “sum by (service_name, node_name) (mysql_up) == 0”: error getting response from http://127.0.0.1:9090/prometheus/api/v1/query?query=sum+by+(service_name%2C+node_name)+(mysql_up)+%3D%3D+0: Post “http://127.0.0.1:9090/prometheus/api/v1/query?query=sum+by+(service_name%2C+node_name)+(mysql_up)+%3D%3D+0”: EOF

and also the some of the above alerts display the hostname while other don’t for eg :

[FIRING:1] (/rule_id/a1af9575-f6df-4700-ae76-93203d297644 PMM Integrated Alerting 1 aw52-ls-aurora-z02-rds-20 /rule_id/a1af9575-f6df-4700-ae76-93203d297644 aw2-p-aurora-c01-rds-01 critical pmm_mysql_down)

[FIRING:2] (/rule_id/a1af9575-f6df-4700-ae76-93203d297644 PMM Integrated Alerting 1 /rule_id/a1af9575-f6df-4700-ae76-93203d297644 critical pmm_mysql_down)

and I can’t even get to the alerts when I click on them , but even otherwise no alerts is displayed in the alerts tab

Sorry for all the details… I’ve been juggling to get this working for sometime now .:slight_smile:

1 Like

Here’s a template you can work from:

---
templates:
    - name: CPU Exceeds threshold
      version: 1
      summary: CPU Exceeds threshold
      expr: |-
        100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[2m])) * 100) 
        > [[ .threshold ]]
      params:
        - name: threshold
          summary: A percentage from configured maximum
          unit: '%'
          type: float
          range: [0, 100]
          value: 80
      for: 2m
      severity: warning
      labels:
        foo: bar
      annotations:
        description: |-
            CPU exceeds  [[ .threshold ]]% on {{ $labels.instance }}
            VALUE = {{ $value }}
            LABELS: {{ $labels }}
        summary: CPU exceeds threshold on  (instance {{ $labels.instance }})

you can then create an alert rule based on that template and set your threshold to whatever % you want to be alerted after and duration the criteria must be true to alert (default is 2 min from this template but you can drop it down to 30 seconds or whatever you want.

1 Like

OK . Thanks much . will try it out . But so far I have been working with built-in templates . Is anything wrong with the built-in templates or is it something with my filters/labels ? All the errors I’ve listed above are for the built-in templates

1 Like

There’s a bug that will be fixed in 2.15.0 (I think) as some templates are converting the ‘<’ as < but you’d see that in the /srv/logs/vmalert.log as a reason it couldn’t parse.

The query error there make it seem like victoriametrics isn’t running (error getting response from) which I’ve not seen Victoriametrics crash but I have seen vmalert crash with my poorly formed rules.

As for the inconsistency of the output…we’ll fix that! you can manually edit the /etc/ia/rules/ files to change “instance” to “node_name” in the output to give a more useful value…I had to ask this as well so you’re not alone!

1 Like

Got it . Yes , I do see several errors in log files

|2021-02-12T07:29:34.026Z|error|/home/builder/rpm/BUILD/VictoriaMetrics-pmm-6401-v1.50.2/app/vmalert/main.go:95|error while reloading rules: cannot parse configuration file: errors(2): invalid group "PMM Integrated Alerting" in file "/etc/ia/rules/03a35104-bd29-4e6a-9cfc-7d75da41b27e.yml": invalid expression for rule "PMM Integrated Alerting"."/rule_id/03a35104-bd29-4e6a-9cfc-7d75da41b27e": cannot recognize "&lt; 1"; unparsed data: "mysql_global_status_uptime&lt; 1"|
|---|---|---|---|
|2021-02-12T07:29:44.337Z|error|/home/builder/rpm/BUILD/VictoriaMetrics-pmm-6401-v1.50.2/app/vmalert/main.go:95|error while reloading rules: cannot parse configuration file: errors(2): invalid group "PMM Integrated Alerting" in file "/etc/ia/rules/03a35104-bd29-4e6a-9cfc-7d75da41b27e.yml": invalid expression for rule "PMM Integrated Alerting"."/rule_id/03a35104-bd29-4e6a-9cfc-7d75da41b27e": cannot recognize "&lt; 1"; unparsed data: "mysql_global_status_uptime&lt; 1"|
2021-02-12T07:29:44.337Z        error   /home/builder/rpm/BUILD/VictoriaMetrics-pmm-6401-v1.50.2/app/vmalert/main.go:95 error while reloading rules: cannot parse configuration file: errors(2): invalid group "PMM Integrated Alerting" in file "/etc/ia/rules/03a35104-bd29-4e6a-9cfc-7d75da41b27e.yml": invalid expression for rule "PMM Integrated Alerting"."/rule_id/03a35104-bd29-4e6a-9cfc-7d75da41b27e": cannot recognize "&lt; 1"; unparsed data: "mysql_global_status_uptime&lt; 1"
invalid group "PMM Integrated Alerting" in file "/etc/ia/rules/660fda01-27b9-42be-8892-095c2b1880dc.yml": invalid expression for rule "PMM Integrated Alerting"."/rule_id/660fda01-27b9-42be-8892-095c2b1880dc": cannot recognize "&lt; 50"; unparsed data: "100&lt; 50"
2021-02-12T19:12:44.787Z        error   /home/builder/rpm/BUILD/VictoriaMetrics-pmm-6401-v1.50.2/app/vmalert/group.go:245       group "PMM Integrated Alerting": rule "/rule_id/a1af9575-f6df-4700-ae76-93203d297644": failed to execute: failed to execute query "sum by (service_name, node_name) (mysql_up) == 0": error getting response from http://127.0.0.1:9090/prometheus/api/v1/query?query=sum+by+%28service_name%2C+node_name%29+%28mysql_up%29+%3D%3D+0: Post "http://127.0.0.1:9090/prometheus/api/v1/query?query=sum+by+%28service_name%2C+node_name%29+%28mysql_up%29+%3D%3D+0": EOF
2021-02-12T22:02:45.953Z        error   /home/builder/rpm/BUILD/VictoriaMetrics-pmm-6401-v1.50.2/app/vmalert/remotewrite/remotewrite.go:213     attempt 1 to send request failed: error while sending request to http://127.0.0.1:9090/prometheus/api/v1/write: Post "http://127.0.0.1:9090/prometheus/api/v1/write": EOF; Data len 325(325)

can I replace “<” with < and see if that works ? or will I break IA ? I see such errors for mysql_status and out of memory templates, but atleast the CPU alerts should have been working. I don’t see any errors related to it , but no alerts as well … :confused:

1 Like

Hi,

Yes rules can be edited manually. Also vmalert service has to be restarted.
Rules are located in folder /etc/ia/rules

1 Like