False "got worse" notification

roman.vysin · 29 April 2024 12:18

Hello,

I have follow up for issue:

github.com/librenms/librenms

false "got worse" notification send

opened 02:25PM - 16 Nov 23 UTC

closed 02:19PM - 09 Jan 24 UTC

albertsulva

### The problem Hi there. We found that LibreNMS makes notifcation "got wors…e" for rule with port down, even if the port is up before time to send the notification. Rules for device group: ``` Max: 1 Delay: 900 Interval: 300 ports.ifOperStatus = "down" AND ports.ifOperStatus_prev = "up" AND macros.device_up = 1 AND ports.ifAdminStatus != "down" ``` And there are two eventlogs from switches: ``` 2023-11-16 02:25:02 alert switch1 Issued acknowledgment for rule '060 Port DOWN' to transport 'mail' System 2023-11-15 22:31:02 alert switch1 Issued got worse for rule '060 Port DOWN' to transport 'mail' System 2023-11-15 22:31:02 alert switch1 Issued got worse for rule '060 Port DOWN' to transport 'playsms' System 2023-11-15 22:20:21 Gi1/0/28 switch1 ifOperStatus: down -> up System 2023-11-15 22:20:21 Gi1/0/28 switch1 ifDuplex: unknown -> fullDuplex System 2023-11-15 22:15:36 Gi1/0/28 switch1 ifOperStatus: up -> down System 2023-11-15 22:15:36 Gi1/0/28 switch1 ifDuplex: fullDuplex -> unknown System 2023-11-14 09:26:02 alert switch1 Issued warning alert for rule '060 Port DOWN' to transport 'playsms' System 2023-11-15 18:56:02 alert switch2 Issued acknowledgment for rule '060 Port DOWN' to transport 'mail' System 2023-11-15 18:51:02 alert switch2 Issued got worse for rule '060 Port DOWN' to transport 'mail' System 2023-11-15 18:40:45 Gi1/0/17 switch2 ifOperStatus: down -> up System 2023-11-15 18:40:45 Gi1/0/17 switch2 ifDuplex: unknown -> fullDuplex System 2023-11-15 18:35:52 Gi1/0/17 switch2 ifOperStatus: up -> down System 2023-11-15 18:35:52 Gi1/0/17 switch2 ifDuplex: fullDuplex -> unknown System 2023-11-15 17:46:02 alert switch2 Issued acknowledgment for rule '060 Port DOWN' to transport 'mail' System 2023-11-15 16:31:02 alert switch2 Issued got better for rule '060 Port DOWN' to transport 'mail' System 2023-11-15 16:15:54 Gi2/0/18 switch2 ifOperStatus: down -> up System 2023-11-15 16:15:54 Gi2/0/18 switch2 ifDuplex: unknown -> fullDuplex System 2023-11-14 14:10:02 alert switch2 Issued acknowledgment for rule '060 Port DOWN' to transport 'mail' System 2023-11-14 10:20:55 Gi1/0/47 switch2 ifAlias: XX-YYYY -> FREE System 2023-11-14 10:20:55 Gi1/0/47 switch2 ifAdminStatus: up -> down System 2023-11-14 09:35:24 Gi1/0/26 switch2 ifAlias: XX-YYZZ -> FREE System 2023-11-14 09:35:24 Gi1/0/26 switch2 ifAdminStatus: up -> down System 2023-11-13 16:26:02 alert switch2 Issued got worse for rule '060 Port DOWN' to transport 'mail' System 2023-11-13 16:10:41 Gi2/0/18 switch2 ifOperStatus: up -> down System 2023-11-13 16:10:41 Gi2/0/18 switch2 ifDuplex: fullDuplex -> unknown System 2023-11-13 11:06:02 alert switch2 Issued acknowledgment for rule '060 Port DOWN' to transport 'mail' System ``` ### Output of ./validate.php ```txt =========================================== Component | Version --------- | ------- LibreNMS | 23.10.0-71-gfaf66035e (2023-11-15T15:21:06+01:00) DB Schema | 2023_11_04_125846_packages_increase_name_column_length (273) PHP | 8.1.24 Python | 3.6.8 Database | MariaDB 10.3.28-MariaDB RRDTool | 1.7.0 SNMP | 5.8 =========================================== [OK] Composer Version: 2.6.5 [OK] Dependencies up-to-date. [WARN] Debug enabled. This is a security risk. [OK] Database connection successful [OK] Database Schema is current [OK] SQL Server meets minimum requirements [OK] lower_case_table_names is enabled [OK] MySQL engine is optimal [OK] Database and column collations are correct [OK] Database schema correct [OK] MySQl and PHP time match [OK] Active pollers found [OK] Dispatcher Service not detected [OK] Locks are functional [FAIL] Some poller nodes have not checked in recently Inactive Nodes: librenms.fqdn [OK] Redis is unavailable [OK] rrdtool version ok [OK] Connected to rrdcached [WARN] Your local git contains modified files, this could prevent automatic updates. [FIX]: You can fix this with ./scripts/github-remove Modified Files: rrd/.gitignore Poller node is active, this FAIL is relict of renaming. :/ ``` ### What was the last working version of LibreNMS? _No response_ ### Anything in the logs that might be useful for us? _No response_

We have changed our workflow for notifcations with ports that I thought will minimilize these false “got worse” but they are still occuring.

We use rule with 15 minutes delay for notifications:
ports.ifOperStatus = “down” AND ports.ifOperStatus_prev = “up” AND macros.device_up = 1 AND ports.ifAdminStatus != “down”

it is used on switches where customers are connected and we inform them about outage of their port if it is longer period of time so we can avoid working with notifications when customer only restart their connected device
we use “Reset Port State” feature in switch settings for longer disconnected ports but NOC can´t use it during weekend

We have this in alert template:
@if ($alert->faults) Faults:

@foreach ($alert->faults as $key => $value)
#{{ $key }}:
Port: {{ $value[‘ifName’] }}
Port Name: {{ $value[‘ifAlias’] }}
Port Status: {{ $value[‘ifOperStatus’] }}
@endforeach
@endif

Problem is that delay is working correctly if fisrt port went DOWN+UP on switch within 15 minutes:
2024-04-23 21:55:43 Gi1/0/12 switch2 ifOperStatus: up → down System
2024-04-23 21:55:43 Gi1/0/12 switch2 ifDuplex: fullDuplex → unknown System
2024-04-23 22:00:52 Gi1/0/12 switch2 ifOperStatus: down → up System
2024-04-23 22:00:52 Gi1/0/12 switch2 ifDuplex: unknown → fullDuplex System

no notificatiuon was send

Here is example when there is one port DOWN longer and second port went DOWN/UP:

2024-04-27 12:25:55 Gi2/0/48 switch1 ifDuplex: fullDuplex → unknown System
2024-04-27 12:30:55 Gi2/0/48 switch1 ifOperStatus: up → down System
2024-04-27 12:30:55 Gi2/0/48 switch1 ifSpeed: 1 Gbps → 10 Mbps System
2024-04-27 12:46:02 alert switch1 Issued warning alert for rule ‘060 Port DOWN’ to transport ‘mail’ System
2024-04-27 12:46:02 alert switch1 Issued warning alert for rule ‘060 Port DOWN’ to transport ‘playsms’ System
- correct notification with port Gi2/0/48 in text of mail

2024-04-27 13:41:00 Gi2/0/15 switch1 ifOperStatus: up → down System
2024-04-27 13:41:00 Gi2/0/15 switch1 ifDuplex: fullDuplex → unknown System
2024-04-27 13:45:30 Gi2/0/15 switch1 ifOperStatus: down → up System
2024-04-27 13:45:30 Gi2/0/15 switch1 ifDuplex: unknown → fullDuplex System
2024-04-27 13:56:02 alert switch1 Issued got worse for rule ‘060 Port DOWN’ to transport ‘mail’ System
2024-04-27 13:56:02 alert switch1 Issued got worse for rule ‘060 Port DOWN’ to transport ‘playsms’ System
- false “got worse” notification and there was only port Gi2/0/48 in text of mail

2024-04-28 00:15:44 Gi2/0/32 switch1 ifOperStatus: up → down System
2024-04-28 00:15:44 Gi2/0/32 switch1 ifDuplex: fullDuplex → unknown System
2024-04-28 00:20:47 Gi2/0/32 switch1 ifOperStatus: down → up System
2024-04-28 00:20:47 Gi2/0/32 switch1 ifDuplex: unknown → fullDuplex System
2024-04-28 00:31:02 alert switch1 Issued got worse for rule ‘060 Port DOWN’ to transport ‘mail’ System
2024-04-28 00:31:02 alert switch1 Issued got worse for rule ‘060 Port DOWN’ to transport ‘playsms’ System
- false “got worse” notification and there was only port Gi2/0/48 in text of mail

2024-04-29 09:48:49 switch1 Port state history reset by admin admin
2024-04-29 09:51:02 alert switch1 Issued recovery for rule ‘060 Port DOWN’ to transport ‘mail’ System
2024-04-29 09:51:02 alert switch1 Issued recovery for rule ‘060 Port DOWN’ to transport ‘playsms’ System
- cleared by me

I looks like there is missing check if worse condition is still present before sending notification “got worse”. These false notification are realy confusing for our NOC collegues.

Can you look at this issue?

Thanks

Roman

murrant · 6 May 2024 00:37

I could see that that got worse condition is not granular, it only counts the results.
There are two ways around this, update the code to make sure it is the same results (not sure how feasible that is) or update your alert rules to work around the issue.

f0o · 9 June 2024 08:11

@roman.vysin What bothers me in that example is that the current implementation should prevent that unless at the same time there was a different port that went down. It only counts results of the alert-query so if it flaps it should at most remain the same or less as the previous iteration. Worsen conditions always mean that more results were found.

It would be beneficial to get a copy of the alert_log entries which contain a gzip’d JSON in the details column with the raw results of all matching conditions. If you can match your eventlog to the entries of the alert_log table then you can unzip that column and compare the json entries, if the count is indeed more than the previous entry then there is no false-positive. If there is a flapping where it goes 2 > 1 > 2 then that might actually cause a worsen notification regardless of the delay set.

@murrant We could get rid of the whole grouping (the worsen/betters feature) and treat each result as an own incident; It will create loads of clutter as a result tho so probably best to make it a toggle per rule.

Just my 2 cents

roman.vysin · 10 June 2024 11:41

Thanks for both comments.

@murrant
I almost mitigated this with adding expection to alert rule like “AND ports.ifAlias NOT LIKE ‘cust_service%’” so we can ignore ports of customers who regularly shutdown their device for longer period of time but it is not ideal.

@f0o
We can be only get that there are entries in database for example above but we are not able to get that file you mentioned:

| id | rule_id | device_id | state | detail | time_logged |
| 64220 | 17 | 223 | 1 | random letters | 2024-04-27 12:30:55 |
| 64221 | 17 | 223 | 3 | random letters | 2024-04-27 13:41:01 |
| 64224 | 17 | 223 | 3 | random letters | 2024-04-28 00:16:01 |
| 64235 | 17 | 223 | 0 | x▒ | 2024-04-29 09:50:38 |

There are not entries for times of port UP.

Can you help us to get that file from database so we can check whats inside? We tried to search for it but found nothing usable as noone is DB specialist here.

f0o · 10 June 2024 12:13

The detail column contains GZip compressed bytes that decompress into JSON. That result is what’s being counted for the better/worse conditions.

Extracting it might be a finicky so ideally whip up some python/bash/xyz to pull it, uncompress and dump you the raw json.

Unfortunately we do not provide any helper-scripts to do this since it’s rarely needed to peer into these ungodly depths