We are running LibreNMS to monitor a network with alot of sites with switches and routers. The WAN that connects to the routers can be flakey so we get quite a few false positves with devices going down, alerts being sent and then recovering and a recovery being issued quite soon after.
We mostly use email as the transport and we receive alot of device down and recovery emails. This can lead to mistakes as hthe ammount of noice means an operator dosn’t notice a device didn’t issue a recovery email.
We would like to implement something where if the device hasn’t recovered and been acknowledged a higher priority alert is issued so the problem can be escalated. My thinking was to transport it to something like Alerta so that it can be correlated and the priority raised if the state has continued.
I found the folowing when researching if this has all ready been implemented.
Specifically the Routes and the monritoring of alerts themselves if things have been acknowledged. The Github issue shows that some of this functionality was being tested;
I have also seen laf merged the code into the alerts.inc.php
The function that looks interesting is RunFollowUp
/**
Run Follow-Up alerts
@return void
*/
function RunFollowUp()
This appears to keep a track of alerts and work out if they have got better or worse, which we could then use for escaltion purposes.
However, I can’t see this function used anywhere else. So I have a question;
This appears to be in the code but I’m not sure how we go about invoking it.
Can this be done from the Alert Rule builder, a template, some sort of macro? Or do we have need to take this futher with php?
Or, you could increase the alert “delay” to 6m, to cover 2 x SNMP polls, which should greatly decrease false positives.
Not sure on the other question, but i know OpsGenie and other transports have smart features, such as “If not acknowledged in 5 minutes, and after 5pm, ring my mobile” etc which ties into the LibreNMS acknowledgments i think.
Thanks for your comments. We already use delay and have tested with ping only. This still leads to alot of false posititves.
I am looking at other trasnport methods like Alerta but I need to supply the metadata to this to get it to escalate. I am trying to pull this from Libre and get information about the state of the alerts them selves but this dosn’t appear to be tracked, they are very much, triggered or not.
We have also tried to adjust the Ping timeout but as this is a global value when we have increased it it means we dont complete the whole polling run in one cycle. Which leads to other issues.
I appreciate you trying to help but I think think Alert escalation would be very useful and would like to understand where this functionality is currently.