Alerts recovery notification taking three hours

CDawson1 · 9 February 2018 13:03

We’ve somewhat recently deployed an instance of LibreNMS and for the most part, have leveraged some default settings for various things.

We’ve created some alerts and have noticed some odd behavior for subsequent notifications of said alerts.

In particular, we have one defined for BGP peer down/up. In this case…

For reference, I’d made some changes yesterday that caused some peering to disappear, which I’ve subsequently removed the corresponding configuration this morning. I’d have expected these alerts to clear after the next polling cycle as the condition was no longer met. I’d made the change before 9am this morning and received a event recovered email about three hours later.

I don’t see anything that might reference a 3 hour hold time. Is there anywhere I should look to ensure that these events are alerted on after the next polling cycle?

laf · 10 February 2018 13:37

Most likely bad timezones, run ./validate.php and post the output.

CDawson1 · 10 February 2018 14:07

I can’t do that at the moment, but I can tell you that it’s output is ok. It’s set to GMT and other alerts clear after the next poll.

What would the reasoning be for a bad time zone? It’s relative time, not absolute time is it not?

Kevin_Krumm · 10 February 2018 16:11

If the PHP timezone is not set correctly then it could cause the weird alert transport delay. It would need to be set in php.ini
Also, check the event log in the web UI there you can see when the alert happened then resolved, just to be sure.

CDawson1 · 12 February 2018 11:46

Further to the above, here’s some additional info…

First, the device’s log (one of 'em anyway).

For context, 1046 on Feb 8th is when I removed the peer it’s complaining about. At just before 9 on the 9th is when I removed the relevant configuration from the device. At that point, I’d have expected the next poll to clear the alert.

Feb  8 10:46:02.549 gmt: %BGP-3-NOTIFICATION: sent to neighbor xxx.xxx.51.130 4/0 (hold time expired) 0 bytes 
Feb  8 10:46:02.550 gmt: %BGP-5-NBR_RESET: Neighbor xxx.xxx.51.130 reset (BGP Notification sent)
Feb  8 10:46:02.550 gmt: %BGP-5-ADJCHANGE: neighbor xxx.xxx.51.130 Down BGP Notification sent
Feb  8 10:46:02.550 gmt: %BGP_SESSION-5-ADJCHANGE: neighbor xxx.xxx.51.130 IPv4 Unicast topology base removed from session  BGP Notification sent
Feb  8 10:53:00.392 gmt: %BGP-3-NOTIFICATION: sent to neighbor xxx.xxx.51.131 4/0 (hold time expired) 0 bytes 
Feb  8 10:53:00.392 gmt: %BGP-5-NBR_RESET: Neighbor xxx.xxx.51.131 reset (BGP Notification sent)
Feb  8 10:53:00.393 gmt: %BGP-5-ADJCHANGE: neighbor xxx.xxx.51.131 Down BGP Notification sent
Feb  8 10:53:00.393 gmt: %BGP_SESSION-5-ADJCHANGE: neighbor xxx.xxx.51.131 IPv4 Unicast topology base removed from session  BGP Notification sent

Feb  9 08:52:14.998 gmt: %BGP-3-NOTIFICATION: sent to neighbor xxx.xxx.51.130 active 6/3 (Peer De-configured) 0 bytes 
Feb  9 08:52:14.998 gmt: %BGP-5-NBR_RESET: Neighbor xxx.xxx.51.130 active reset (Neighbor deleted)
Feb  9 08:52:15.000 gmt: %BGP-5-ADJCHANGE: neighbor xxx.xxx.51.130 active Down Neighbor deleted

And the corresponding Libre event log…

…which you can see didn’t clear the alert until 12:46, nearly four hours after the condition cleared.

Just to show this wasn’t an isolated event for this device…

As you can see, the time of each device is spot on…

XXXXXXXXXXXX#sho time       
Load for five secs: 0%/0%; one minute: 1%; five minutes: 1%
Time source is NTP, 11:45:02.469 gmt Mon Feb 12 2018

root@xxxxxxx:/opt/librenms/# uptime
 11:45:02 up 3 days, 21:18,  1 user,  load average: 0.03, 0.42, 0.60

And the config validation only complains about some weathermap permissions, as expected.

Kevin_Krumm · 12 February 2018 11:57

I would still double check your php.ini for the timezone

CDawson1 · 12 February 2018 12:07

This was set on the initial configuration…

root@txxxx:/etc/php/7.0/fpm# cat php.ini | grep timezone
date.timezone = Europe/London

Kevin_Krumm · 12 February 2018 13:17

Looking just the event log looks like alert rule triggered at 2-8-18 10:56 then it was ack at 16:23 which doesn’t clear the alert btw. Then the alert cleared “recovered” at 12:46 which lines up with the device event log.

Only else i would check is the system time on your server that is running librenms.

laf · 12 February 2018 14:13

Actually I don’t think we delete bgp sessions until discovery runs which is every 6 hours at 33 mins past. That ties in with your events at 12:46 I expect.

CDawson1 · 12 February 2018 14:17

@Kevin_Krumm Yeah, I know acking doesn’t clear it. What did actually clear it is me removing the relevant configuration at 0852 on the 9th.

@laf That makes more sense, though I woudn’t think that should be the case, but I understand why that IS the case.

I don’t want to increase the discovery timing, but what would be a better approach to this?