False "got worse" notification

Hello,

I have follow up for issue:

We have changed our workflow for notifcations with ports that I thought will minimilize these false “got worse” but they are still occuring.

We use rule with 15 minutes delay for notifications:
ports.ifOperStatus = “down” AND ports.ifOperStatus_prev = “up” AND macros.device_up = 1 AND ports.ifAdminStatus != “down”

  • it is used on switches where customers are connected and we inform them about outage of their port if it is longer period of time so we can avoid working with notifications when customer only restart their connected device
  • we use “Reset Port State” feature in switch settings for longer disconnected ports but NOC can´t use it during weekend

We have this in alert template:
@if ($alert->faults) Faults:

@foreach ($alert->faults as $key => $value)
#{{ $key }}:
Port: {{ $value[‘ifName’] }}
Port Name: {{ $value[‘ifAlias’] }}
Port Status: {{ $value[‘ifOperStatus’] }}
@endforeach
@endif

Problem is that delay is working correctly if fisrt port went DOWN+UP on switch within 15 minutes:
2024-04-23 21:55:43 Gi1/0/12 switch2 ifOperStatus: up → down System
2024-04-23 21:55:43 Gi1/0/12 switch2 ifDuplex: fullDuplex → unknown System
2024-04-23 22:00:52 Gi1/0/12 switch2 ifOperStatus: down → up System
2024-04-23 22:00:52 Gi1/0/12 switch2 ifDuplex: unknown → fullDuplex System

  • no notificatiuon was send

Here is example when there is one port DOWN longer and second port went DOWN/UP:

2024-04-27 12:25:55 Gi2/0/48 switch1 ifDuplex: fullDuplex → unknown System
2024-04-27 12:30:55 Gi2/0/48 switch1 ifOperStatus: up → down System
2024-04-27 12:30:55 Gi2/0/48 switch1 ifSpeed: 1 Gbps → 10 Mbps System
2024-04-27 12:46:02 alert switch1 Issued warning alert for rule ‘060 Port DOWN’ to transport ‘mail’ System
2024-04-27 12:46:02 alert switch1 Issued warning alert for rule ‘060 Port DOWN’ to transport ‘playsms’ System
- correct notification with port Gi2/0/48 in text of mail

2024-04-27 13:41:00 Gi2/0/15 switch1 ifOperStatus: up → down System
2024-04-27 13:41:00 Gi2/0/15 switch1 ifDuplex: fullDuplex → unknown System
2024-04-27 13:45:30 Gi2/0/15 switch1 ifOperStatus: down → up System
2024-04-27 13:45:30 Gi2/0/15 switch1 ifDuplex: unknown → fullDuplex System
2024-04-27 13:56:02 alert switch1 Issued got worse for rule ‘060 Port DOWN’ to transport ‘mail’ System
2024-04-27 13:56:02 alert switch1 Issued got worse for rule ‘060 Port DOWN’ to transport ‘playsms’ System
- false “got worse” notification and there was only port Gi2/0/48 in text of mail

2024-04-28 00:15:44 Gi2/0/32 switch1 ifOperStatus: up → down System
2024-04-28 00:15:44 Gi2/0/32 switch1 ifDuplex: fullDuplex → unknown System
2024-04-28 00:20:47 Gi2/0/32 switch1 ifOperStatus: down → up System
2024-04-28 00:20:47 Gi2/0/32 switch1 ifDuplex: unknown → fullDuplex System
2024-04-28 00:31:02 alert switch1 Issued got worse for rule ‘060 Port DOWN’ to transport ‘mail’ System
2024-04-28 00:31:02 alert switch1 Issued got worse for rule ‘060 Port DOWN’ to transport ‘playsms’ System
- false “got worse” notification and there was only port Gi2/0/48 in text of mail

2024-04-29 09:48:49 switch1 Port state history reset by admin admin
2024-04-29 09:51:02 alert switch1 Issued recovery for rule ‘060 Port DOWN’ to transport ‘mail’ System
2024-04-29 09:51:02 alert switch1 Issued recovery for rule ‘060 Port DOWN’ to transport ‘playsms’ System
- cleared by me

I looks like there is missing check if worse condition is still present before sending notification “got worse”. These false notification are realy confusing for our NOC collegues.

Can you look at this issue?

Thanks

Roman

I could see that that got worse condition is not granular, it only counts the results.
There are two ways around this, update the code to make sure it is the same results (not sure how feasible that is) or update your alert rules to work around the issue.

@roman.vysin What bothers me in that example is that the current implementation should prevent that unless at the same time there was a different port that went down. It only counts results of the alert-query so if it flaps it should at most remain the same or less as the previous iteration. Worsen conditions always mean that more results were found.

It would be beneficial to get a copy of the alert_log entries which contain a gzip’d JSON in the details column with the raw results of all matching conditions. If you can match your eventlog to the entries of the alert_log table then you can unzip that column and compare the json entries, if the count is indeed more than the previous entry then there is no false-positive. If there is a flapping where it goes 2 > 1 > 2 then that might actually cause a worsen notification regardless of the delay set.

@murrant We could get rid of the whole grouping (the worsen/betters feature) and treat each result as an own incident; It will create loads of clutter as a result tho so probably best to make it a toggle per rule.

Just my 2 cents

Thanks for both comments.

@murrant
I almost mitigated this with adding expection to alert rule like “AND ports.ifAlias NOT LIKE ‘cust_service%’” so we can ignore ports of customers who regularly shutdown their device for longer period of time but it is not ideal.

@f0o
We can be only get that there are entries in database for example above but we are not able to get that file you mentioned:

| id | rule_id | device_id | state | detail | time_logged |
| 64220 | 17 | 223 | 1 | random letters | 2024-04-27 12:30:55 |
| 64221 | 17 | 223 | 3 | random letters | 2024-04-27 13:41:01 |
| 64224 | 17 | 223 | 3 | random letters | 2024-04-28 00:16:01 |
| 64235 | 17 | 223 | 0 | x▒ | 2024-04-29 09:50:38 |

There are not entries for times of port UP.

Can you help us to get that file from database so we can check whats inside? We tried to search for it but found nothing usable as noone is DB specialist here.

The detail column contains GZip compressed bytes that decompress into JSON. That result is what’s being counted for the better/worse conditions.

Extracting it might be a finicky so ideally whip up some python/bash/xyz to pull it, uncompress and dump you the raw json.

Unfortunately we do not provide any helper-scripts to do this since it’s rarely needed to peer into these ungodly depths :sweat_smile: