Alert Escalation - Alert Routing

Hi All,

We are running LibreNMS to monitor a network with alot of sites with switches and routers. The WAN that connects to the routers can be flakey so we get quite a few false positves with devices going down, alerts being sent and then recovering and a recovery being issued quite soon after.

We mostly use email as the transport and we receive alot of device down and recovery emails. This can lead to mistakes as hthe ammount of noice means an operator dosn’t notice a device didn’t issue a recovery email.

We would like to implement something where if the device hasn’t recovered and been acknowledged a higher priority alert is issued so the problem can be escalated. My thinking was to transport it to something like Alerta so that it can be correlated and the priority raised if the state has continued.

I found the folowing when researching if this has all ready been implemented.

Specifically the Routes and the monritoring of alerts themselves if things have been acknowledged. The Github issue shows that some of this functionality was being tested;

I have also seen laf merged the code into the alerts.inc.php

The function that looks interesting is RunFollowUp

/**

  • Run Follow-Up alerts
  • @return void
    */
    function RunFollowUp()

This appears to keep a track of alerts and work out if they have got better or worse, which we could then use for escaltion purposes.

However, I can’t see this function used anywhere else. So I have a question;

This appears to be in the code but I’m not sure how we go about invoking it.
Can this be done from the Alert Rule builder, a template, some sort of macro? Or do we have need to take this futher with php?

Thanks
Duncan

If your end device has ICMP, then you could change the alert rule to act on that instead of SNMP for device down. False positive device down - LibreNMS 1.38-48-gfbbc257

Or, you could increase the alert “delay” to 6m, to cover 2 x SNMP polls, which should greatly decrease false positives.

Not sure on the other question, but i know OpsGenie and other transports have smart features, such as “If not acknowledged in 5 minutes, and after 5pm, ring my mobile” etc which ties into the LibreNMS acknowledgments i think.

Hi Chas,

Thanks for your comments. We already use delay and have tested with ping only. This still leads to alot of false posititves.

I am looking at other trasnport methods like Alerta but I need to supply the metadata to this to get it to escalate. I am trying to pull this from Libre and get information about the state of the alerts them selves but this dosn’t appear to be tracked, they are very much, triggered or not.

Duncan

If you re getting false alerts with ping then you may need to adjust your ping time outs.

Hi Kevin,

We have also tried to adjust the Ping timeout but as this is a global value when we have increased it it means we dont complete the whole polling run in one cycle. Which leads to other issues.

I appreciate you trying to help but I think think Alert escalation would be very useful and would like to understand where this functionality is currently.

Duncan

We don’t have that functionality we need somebody to code it. :slight_smile:

1 Like

I’m not sure about the coding, but i see that when a device goes down there is a downtime counter,
so it could possibly look something like:

if macros.device_down = 1 && downtime <=60 mins,
then alert 1,
else
if macros.device_down = 1 && downtime >=60 mins,
then alert 2,