Missing recoveries

Tozz · 17 March 2024 09:25

Please provide ALL info asked for here.

I’ve been working on a new Transport:

github.com/librenms/librenms

Alert Transport for TOPdesk

librenms:master ← rudybroersma:topdesktransport

opened 07:33PM - 15 Mar 24 UTC

rudybroersma

+776 -1

This is a WIP initial version of my TOPdesk Alert Transport. It works, but there… are a few minor things I want to clean up/inprove. I'd love to get some initial feedback on what should be changed/improved before this would be accepted. > Please read this information carefully. You can run `./lnms dev:check` to check your code before submitting. - [ ] Have you followed our [code guidelines?](https://docs.librenms.org/Developing/Code-Guidelines/) - [ ] If my Pull Request does some changes/fixes/enhancements in the WebUI, I have inserted a screenshot of it. - [ ] If my Pull Request makes discovery/polling/yaml changes, I have added/updated [test data](https://docs.librenms.org/Developing/os/Test-Units/). #### Testers If you would like to test this pull request then please run: `./scripts/github-apply <pr_id>`, i.e `./scripts/github-apply 5926` After you are done testing, you can remove the changes with `./scripts/github-remove`. If there are schema changes, you can ask on discord how to revert.

And I’ve noticed that not all my incidents are being closed when the alert recovers.
It looks quite random and looks identical to:

I have ‘max alert’ set to 1 as well.

Because my alert transport contains quite a few debugging lines, i’m quite confident that LibreNMS isn’t calling deliverAlert()

For example, the log output is (removed UUIDs for brevity):


TOPdesk: TopDesk UUID (M2403 539) created for LibreNMS incident 2315 and UID 132985  
TOPdesk: TopDesk UUID (M2403 540) created for LibreNMS incident 2315 and UID 132985  
TOPdesk: TopDesk UUID (M2403 541) created for LibreNMS incident 25568 and UID 145130  
TOPdesk: TopDesk UUID (M2403 542) created for LibreNMS incident 25575 and UID 145131  
TOPdesk: TopDesk UUID (M2403 543) created for LibreNMS incident 2178 and UID 145134  
TOPdesk: TopDesk UUID (M2403 544) created for LibreNMS incident 11947 and UID 145132  
TOPdesk: TopDesk UUID (M2403 545) created for LibreNMS incident 25553 and UID 145133  
TOPdesk: LibreNMS Alert 25568 recovered. Closing..  
TOPdesk: TopDesk Incident (M2403 541) will be closed (Libre ID: 25568)  
TOPdesk: LibreNMS Alert 25575 recovered. Closing..  
TOPdesk: TopDesk Incident (M2403 542) will be closed (Libre ID: 25575)  
TOPdesk: LibreNMS Alert 2178 recovered. Closing..  
TOPdesk: TopDesk Incident (M2403 543) will be closed (Libre ID: 2178)

To clarify:

All the 7 incidents are resolved according to LibreNMS
Only 3 seemed to have called deliverAlert() upon recovery
All 7 incidents occured at abott the same time, same with recoveries (the devices are all behind a single device that went down) (possible race condition?)

The line “TOPdesk Alert x recovered. Closing…” is almost the first thing my transport does in deliverAlert(). So I feel this can’t be a bug in my transport at this stage. It’s here:

github.com

librenms/librenms/blob/8e03913954b13bb59f70416c841567808527d200/LibreNMS/Alert/Transport/Topdesk.php#L82


      
                      $this->addUuidToAlertLog($alert_data['uid'], $incident->getID());
                  }
              } else {
                  $incident = $this->getTopdeskIncident($recent_uuid);
                  $this->addAction('LibreNMS reported this issue again within ' . $this->config['ticket-reopen'] . ' hours. Reopening...', $incident, true);
                  $this->updateIncident($incident, TicketAction::TICKET_OPEN);
                  \Log::channel('single')->alert('TOPdesk: Reopening incident ' . $incident->getNumber());
              }
              break;
          case AlertState::CLEAR:
              \Log::channel('single')->alert('TOPdesk: LibreNMS Alert ' . $alert_data['alert_id'] . ' recovered. Closing..');
              if ($recent_uuid !== false) {
                  $incident = $this->getTopdeskIncident($recent_uuid);
                  \Log::channel('single')->alert('TOPdesk: TopDesk Incident ' . $recent_uuid . ' (' . $incident->getNumber() . ') will be closed (Libre ID: ' . $alert_data['alert_id'] . ')');
                  if ($incident === false) {
                      \Log::channel('single')->alert('TOPdesk: Unable to retrieve TopDesk UUID ' . $recent_uuid . '. Unable to close incident...');
                  } else {
                      $this->addAction('LibreNMS reported the incident as resolved. Closing incident..', $incident, true);
                      $closed = $this->updateIncident($incident, TicketAction::TICKET_CLOSE);
                  }
              } else {

I’d like to help out in finding the cause of this issue, but I’d like some suggestions from someone that has a better understanding of LibreNMS’s inner workings.

For example, the ticket M2403 545 / incident 25553 / alert_log 145133 does have a recovery entry in the alert_log table:

MariaDB [librenms]> select id, rule_id, device_id, state, time_logged from alert_log where device_id = 817 and time_logged LIKE '2024-03-16%';
+--------+---------+-----------+-------+---------------------+
| id     | rule_id | device_id | state | time_logged         |
+--------+---------+-----------+-------+---------------------+
| 145133 |      69 |       817 |     1 | 2024-03-16 12:01:08 |
| 145140 |      69 |       817 |     0 | 2024-03-16 12:12:02 |
+--------+---------+-----------+-------+---------------------+
2 rows in set (0.000 sec)

I have a 10 minute delay set to my alerts, and I noticed the alert and recovery are about the 10 minutes mark apart. Could that be something? eg. perhaps the process of sending a recovery alert is issued before the original alert was created which caused some sort of race condition?

Any ideas?