Missing recoveries

Please provide ALL info asked for here.

I’ve been working on a new Transport:

And I’ve noticed that not all my incidents are being closed when the alert recovers.
It looks quite random and looks identical to:

I have ‘max alert’ set to 1 as well.

Because my alert transport contains quite a few debugging lines, i’m quite confident that LibreNMS isn’t calling deliverAlert()

For example, the log output is (removed UUIDs for brevity):


TOPdesk: TopDesk UUID (M2403 539) created for LibreNMS incident 2315 and UID 132985  
TOPdesk: TopDesk UUID (M2403 540) created for LibreNMS incident 2315 and UID 132985  
TOPdesk: TopDesk UUID (M2403 541) created for LibreNMS incident 25568 and UID 145130  
TOPdesk: TopDesk UUID (M2403 542) created for LibreNMS incident 25575 and UID 145131  
TOPdesk: TopDesk UUID (M2403 543) created for LibreNMS incident 2178 and UID 145134  
TOPdesk: TopDesk UUID (M2403 544) created for LibreNMS incident 11947 and UID 145132  
TOPdesk: TopDesk UUID (M2403 545) created for LibreNMS incident 25553 and UID 145133  
TOPdesk: LibreNMS Alert 25568 recovered. Closing..  
TOPdesk: TopDesk Incident (M2403 541) will be closed (Libre ID: 25568)  
TOPdesk: LibreNMS Alert 25575 recovered. Closing..  
TOPdesk: TopDesk Incident (M2403 542) will be closed (Libre ID: 25575)  
TOPdesk: LibreNMS Alert 2178 recovered. Closing..  
TOPdesk: TopDesk Incident (M2403 543) will be closed (Libre ID: 2178)

To clarify:

  • All the 7 incidents are resolved according to LibreNMS
  • Only 3 seemed to have called deliverAlert() upon recovery
  • All 7 incidents occured at abott the same time, same with recoveries (the devices are all behind a single device that went down) (possible race condition?)

The line “TOPdesk Alert x recovered. Closing…” is almost the first thing my transport does in deliverAlert(). So I feel this can’t be a bug in my transport at this stage. It’s here:

I’d like to help out in finding the cause of this issue, but I’d like some suggestions from someone that has a better understanding of LibreNMS’s inner workings.

For example, the ticket M2403 545 / incident 25553 / alert_log 145133 does have a recovery entry in the alert_log table:

MariaDB [librenms]> select id, rule_id, device_id, state, time_logged from alert_log where device_id = 817 and time_logged LIKE '2024-03-16%';
+--------+---------+-----------+-------+---------------------+
| id     | rule_id | device_id | state | time_logged         |
+--------+---------+-----------+-------+---------------------+
| 145133 |      69 |       817 |     1 | 2024-03-16 12:01:08 |
| 145140 |      69 |       817 |     0 | 2024-03-16 12:12:02 |
+--------+---------+-----------+-------+---------------------+
2 rows in set (0.000 sec)

I have a 10 minute delay set to my alerts, and I noticed the alert and recovery are about the 10 minutes mark apart. Could that be something? eg. perhaps the process of sending a recovery alert is issued before the original alert was created which caused some sort of race condition?

Any ideas?