We have changed our workflow for notifcations with ports that I thought will minimilize these false “got worse” but they are still occuring.
We use rule with 15 minutes delay for notifications:
ports.ifOperStatus = “down” AND ports.ifOperStatus_prev = “up” AND macros.device_up = 1 AND ports.ifAdminStatus != “down”
it is used on switches where customers are connected and we inform them about outage of their port if it is longer period of time so we can avoid working with notifications when customer only restart their connected device
we use “Reset Port State” feature in switch settings for longer disconnected ports but NOC can´t use it during weekend
We have this in alert template: @if ($alert->faults) Faults:
@foreach ($alert->faults as $key => $value)
#{{ $key }}:
Port: {{ $value[‘ifName’] }}
Port Name: {{ $value[‘ifAlias’] }}
Port Status: {{ $value[‘ifOperStatus’] }} @endforeach @endif
Problem is that delay is working correctly if fisrt port went DOWN+UP on switch within 15 minutes:
2024-04-23 21:55:43 Gi1/0/12 switch2 ifOperStatus: up → down System
2024-04-23 21:55:43 Gi1/0/12 switch2 ifDuplex: fullDuplex → unknown System
2024-04-23 22:00:52 Gi1/0/12 switch2 ifOperStatus: down → up System
2024-04-23 22:00:52 Gi1/0/12 switch2 ifDuplex: unknown → fullDuplex System
no notificatiuon was send
Here is example when there is one port DOWN longer and second port went DOWN/UP:
2024-04-27 12:25:55 Gi2/0/48 switch1 ifDuplex: fullDuplex → unknown System
2024-04-27 12:30:55 Gi2/0/48 switch1 ifOperStatus: up → down System
2024-04-27 12:30:55 Gi2/0/48 switch1 ifSpeed: 1 Gbps → 10 Mbps System
2024-04-27 12:46:02 alert switch1 Issued warning alert for rule ‘060 Port DOWN’ to transport ‘mail’ System
2024-04-27 12:46:02 alert switch1 Issued warning alert for rule ‘060 Port DOWN’ to transport ‘playsms’ System
- correct notification with port Gi2/0/48 in text of mail
2024-04-27 13:41:00 Gi2/0/15 switch1 ifOperStatus: up → down System
2024-04-27 13:41:00 Gi2/0/15 switch1 ifDuplex: fullDuplex → unknown System
2024-04-27 13:45:30 Gi2/0/15 switch1 ifOperStatus: down → up System
2024-04-27 13:45:30 Gi2/0/15 switch1 ifDuplex: unknown → fullDuplex System
2024-04-27 13:56:02 alert switch1 Issued got worse for rule ‘060 Port DOWN’ to transport ‘mail’ System
2024-04-27 13:56:02 alert switch1 Issued got worse for rule ‘060 Port DOWN’ to transport ‘playsms’ System
- false “got worse” notification and there was only port Gi2/0/48 in text of mail
2024-04-28 00:15:44 Gi2/0/32 switch1 ifOperStatus: up → down System
2024-04-28 00:15:44 Gi2/0/32 switch1 ifDuplex: fullDuplex → unknown System
2024-04-28 00:20:47 Gi2/0/32 switch1 ifOperStatus: down → up System
2024-04-28 00:20:47 Gi2/0/32 switch1 ifDuplex: unknown → fullDuplex System
2024-04-28 00:31:02 alert switch1 Issued got worse for rule ‘060 Port DOWN’ to transport ‘mail’ System
2024-04-28 00:31:02 alert switch1 Issued got worse for rule ‘060 Port DOWN’ to transport ‘playsms’ System
- false “got worse” notification and there was only port Gi2/0/48 in text of mail
2024-04-29 09:48:49 switch1 Port state history reset by admin admin
2024-04-29 09:51:02 alert switch1 Issued recovery for rule ‘060 Port DOWN’ to transport ‘mail’ System
2024-04-29 09:51:02 alert switch1 Issued recovery for rule ‘060 Port DOWN’ to transport ‘playsms’ System
- cleared by me
I looks like there is missing check if worse condition is still present before sending notification “got worse”. These false notification are realy confusing for our NOC collegues.
I could see that that got worse condition is not granular, it only counts the results.
There are two ways around this, update the code to make sure it is the same results (not sure how feasible that is) or update your alert rules to work around the issue.
@roman.vysin What bothers me in that example is that the current implementation should prevent that unless at the same time there was a different port that went down. It only counts results of the alert-query so if it flaps it should at most remain the same or less as the previous iteration. Worsen conditions always mean that more results were found.
It would be beneficial to get a copy of the alert_log entries which contain a gzip’d JSON in the details column with the raw results of all matching conditions. If you can match your eventlog to the entries of the alert_log table then you can unzip that column and compare the json entries, if the count is indeed more than the previous entry then there is no false-positive. If there is a flapping where it goes 2 > 1 > 2 then that might actually cause a worsen notification regardless of the delay set.
@murrant We could get rid of the whole grouping (the worsen/betters feature) and treat each result as an own incident; It will create loads of clutter as a result tho so probably best to make it a toggle per rule.
@murrant
I almost mitigated this with adding expection to alert rule like “AND ports.ifAlias NOT LIKE ‘cust_service%’” so we can ignore ports of customers who regularly shutdown their device for longer period of time but it is not ideal.
@f0o
We can be only get that there are entries in database for example above but we are not able to get that file you mentioned:
Can you help us to get that file from database so we can check whats inside? We tried to search for it but found nothing usable as noone is DB specialist here.