Alert Escalation - Alert Routing

superchunk2000 · 18 March 2019 15:43

Hi All,

We are running LibreNMS to monitor a network with alot of sites with switches and routers. The WAN that connects to the routers can be flakey so we get quite a few false positves with devices going down, alerts being sent and then recovering and a recovery being issued quite soon after.

We mostly use email as the transport and we receive alot of device down and recovery emails. This can lead to mistakes as hthe ammount of noice means an operator dosn’t notice a device didn’t issue a recovery email.

We would like to implement something where if the device hasn’t recovered and been acknowledged a higher priority alert is issued so the problem can be escalated. My thinking was to transport it to something like Alerta so that it can be correlated and the priority raised if the state has continued.

I found the folowing when researching if this has all ready been implemented.

github.com/librenms/librenms

Next Major Alerting-Update

opened 03:05PM - 20 Jun 15 UTC

closed 09:02PM - 07 May 17 UTC

f0o

Alerting

I'm working on resolving these issues: #897 #995 #1084 The end of the journey w…ill be system that can provide basic routing of the notifications. To clarify, at first I do not intend to route receivers of notifications. The user will be able to change or discard the transport of the notification based on metadata like `interation count` or `notification age` and other metrics (Feedback wished) or other rules. Examples: - if rule `device-status` fails, dont bother sending alerts for the rest of the rules applied to the device - if the notification has been sent every minute for the past 5 minutes, increase the interval to X - if the notification hasn't been ack'd after 15 minutes from the incident, use the SMS-transport to escalate As usual I will push all beta code in my repo at https://github.com/f0o/glowing-tyrion I will make an autobuild on our CI, FYI. I cannot give an ETA, this all happens in my `free-time` between job and uni. Feedback & Comments appreciated. Feel free to use this issue as wishlist. **In Tests:** - Message Routing - Add `elseif` control into Templates - Select arbitrary data from within a Template **In Progress:** - Message Routing - Overall Query and Control Language Remake - ~~HTML & Subject definitions within Templates (Perhaps with logic)~~ **Wishlist:** - Last RRD-Field data alert

github.com

f0o/glowing-tyrion/blob/master/DEVELOPMENT.md

Table of Content:
- [Database](#db)
  - [`alerts`](#db-alerts)
  - [`alert_log`](#db-alert_log)
  - [`alert_rules`](#db-alert_rules)
  - [`alert_map`](#db-alert_map)
  - [`alert_schedule`](#db-alert_schedule)
  - [`alert_schedule_items`](#db-alert_schedule_items)
  - [`alert_templates`](#db-alert_templates)
  - [`alert_templates_map`](#db-alert_templates_map)
- [Files](#files)
  - [`alerts.php`](#files-alerts.php)
  - [`alerts.inc.php`](#files-alerts.inc.php)

# <a name="db">Database</a>

## <a name="db-alerts">Table: `alerts`</a>

Holds an overview of all current states per rule per device.

This file has been truncated. show original

Specifically the Routes and the monritoring of alerts themselves if things have been acknowledged. The Github issue shows that some of this functionality was being tested;

github.com/librenms/librenms

Next Major Alerting-Update

opened 03:05PM - 20 Jun 15 UTC

closed 09:02PM - 07 May 17 UTC

f0o

Alerting

I'm working on resolving these issues: #897 #995 #1084 The end of the journey w…ill be system that can provide basic routing of the notifications. To clarify, at first I do not intend to route receivers of notifications. The user will be able to change or discard the transport of the notification based on metadata like `interation count` or `notification age` and other metrics (Feedback wished) or other rules. Examples: - if rule `device-status` fails, dont bother sending alerts for the rest of the rules applied to the device - if the notification has been sent every minute for the past 5 minutes, increase the interval to X - if the notification hasn't been ack'd after 15 minutes from the incident, use the SMS-transport to escalate As usual I will push all beta code in my repo at https://github.com/f0o/glowing-tyrion I will make an autobuild on our CI, FYI. I cannot give an ETA, this all happens in my `free-time` between job and uni. Feedback & Comments appreciated. Feel free to use this issue as wishlist. **In Tests:** - Message Routing - Add `elseif` control into Templates - Select arbitrary data from within a Template **In Progress:** - Message Routing - Overall Query and Control Language Remake - ~~HTML & Subject definitions within Templates (Perhaps with logic)~~ **Wishlist:** - Last RRD-Field data alert

I have also seen laf merged the code into the alerts.inc.php

The function that looks interesting is RunFollowUp

/**

Run Follow-Up alerts
@return void
*/
function RunFollowUp()

This appears to keep a track of alerts and work out if they have got better or worse, which we could then use for escaltion purposes.

However, I can’t see this function used anywhere else. So I have a question;

This appears to be in the code but I’m not sure how we go about invoking it.
Can this be done from the Alert Rule builder, a template, some sort of macro? Or do we have need to take this futher with php?

Thanks
Duncan

Chas · 18 March 2019 15:59

If your end device has ICMP, then you could change the alert rule to act on that instead of SNMP for device down. False positive device down - LibreNMS 1.38-48-gfbbc257

Or, you could increase the alert “delay” to 6m, to cover 2 x SNMP polls, which should greatly decrease false positives.

Not sure on the other question, but i know OpsGenie and other transports have smart features, such as “If not acknowledged in 5 minutes, and after 5pm, ring my mobile” etc which ties into the LibreNMS acknowledgments i think.

superchunk2000 · 18 March 2019 16:27

Hi Chas,

Thanks for your comments. We already use delay and have tested with ping only. This still leads to alot of false posititves.

I am looking at other trasnport methods like Alerta but I need to supply the metadata to this to get it to escalate. I am trying to pull this from Libre and get information about the state of the alerts them selves but this dosn’t appear to be tracked, they are very much, triggered or not.

Duncan

Kevin_Krumm · 18 March 2019 17:02

If you re getting false alerts with ping then you may need to adjust your ping time outs.

superchunk2000 · 18 March 2019 18:50

Hi Kevin,

We have also tried to adjust the Ping timeout but as this is a global value when we have increased it it means we dont complete the whole polling run in one cycle. Which leads to other issues.

I appreciate you trying to help but I think think Alert escalation would be very useful and would like to understand where this functionality is currently.

Duncan

Kevin_Krumm · 18 March 2019 19:23

We don’t have that functionality we need somebody to code it.

Yusuf-1978 · 11 August 2019 09:02

I’m not sure about the coding, but i see that when a device goes down there is a downtime counter,
so it could possibly look something like:

if macros.device_down = 1 && downtime <=60 mins,
then alert 1,
else
if macros.device_down = 1 && downtime >=60 mins,
then alert 2,