Alerts/acknowledgement for device entities not the whole device

pobradovic08 · 12 March 2019 15:00

Hi,

Not sure if this was answered before but I couldn’t find it. Anyway, I’m having trouble configuring alerts as I would like. Alert rules and transports are working fine, but i can’t seem to figure out some things. Let’s take a router/switch for example with configured alert rule if a port is admin up and operationally down. Two major issues we have at the moment are:

We can’t ack alert for specific port, just the whole device.
Having a device in already alerted state (has one port down), if another port goes down we don’t get another alert for that second port. (We have just one time alerts, without repeat).

It looks like the alerting revolves around a device not smaller entities that belong to that device. Is my understanding correct or am I missing something that would allow me to address those two issues i mentioned?

There are situations where this is very important, for example in ISP environment for access switches. There are times when a customer goes down (severed fiber, dead CPE, etc), we’re aware of the problem and we want to ack just that one interface, but we need to get alerts if some other port goes down or something else goes wrong with the device (because those are kind of unrelated issues).

Thanks!
Pavle

Chas · 12 March 2019 15:24

Ack should be per alert id, not per device.

Can you post the screenshot for the alert rule you are using for port down? including the max interval delay settings.

PipoCanaja · 12 March 2019 20:33

Hi @pobradovic08,
As @Chas said, the alert is “per alert_rule” AND “per device”. So if you create an alert_rule for your core interfaces and another for your customer(s) interfaces, then you can easily achieve this.
PipoCanaja

pobradovic08 · 13 March 2019 11:18

Hi guys,

Thanks for the info! I didn’t know exactly how it works I understand it better now. @Chas, here’s the screenshot of the alert rule:

@PipoCanaja, if I now understand this correctly, even if we have two separate alert checks for infrastructure and customer facing interfaces, that still wouldn’t allow us to get separate alerts for each customer that goes down. Same goes for infrastructure interfaces, if we have just one alert rule for all of them and have a portchannel with 4 interfaces, we still don’t have alerts for those interfaces separately. Does that mean that we have to make a separate alert check for each interface? I think we currently have over 1000 physical interfaces that are up, so that would be a lot of overhead.

Is there a reason why alerts have to be grouped that way and not going more specific to port_id (or some other ID if the alert is not port related)? I will make a feature request explaining what I think should be improved and how, but for now I just want to understand how the alerting system is intended to be used, so I can put myself in proper mindset

Thanks,
Pavle.

PipoCanaja · 13 March 2019 16:15

You would basically need one alert rule “per entity” (customer, backbone, etc) and proper identification on the port descriptions so that you can recognize which port belongs to which entity. That does not scale extremely well I must agree, but this is the way I use LibreNMS right now and it allows proper deduplication without masking stuff between customers and core etc, so that the email notifications are a good approximation of the network state, and give a good hint on the need to connect to the GUI urgently or not.

A few more points:

do not forget that the alerts in the GUI are real time updated, so when you connect to GUI you have all data available.
If you ack the error, any change making it worse will re-trigger an alert, so you will not miss any alerts if your on duty team uses the ack feature.