For reference, I’d made some changes yesterday that caused some peering to disappear, which I’ve subsequently removed the corresponding configuration this morning. I’d have expected these alerts to clear after the next polling cycle as the condition was no longer met. I’d made the change before 9am this morning and received a event recovered email about three hours later.
I don’t see anything that might reference a 3 hour hold time. Is there anywhere I should look to ensure that these events are alerted on after the next polling cycle?
If the PHP timezone is not set correctly then it could cause the weird alert transport delay. It would need to be set in php.ini
Also, check the event log in the web UI there you can see when the alert happened then resolved, just to be sure.
Further to the above, here’s some additional info…
First, the device’s log (one of 'em anyway).
For context, 1046 on Feb 8th is when I removed the peer it’s complaining about. At just before 9 on the 9th is when I removed the relevant configuration from the device. At that point, I’d have expected the next poll to clear the alert.
Feb 8 10:46:02.549 gmt: %BGP-3-NOTIFICATION: sent to neighbor xxx.xxx.51.130 4/0 (hold time expired) 0 bytes
Feb 8 10:46:02.550 gmt: %BGP-5-NBR_RESET: Neighbor xxx.xxx.51.130 reset (BGP Notification sent)
Feb 8 10:46:02.550 gmt: %BGP-5-ADJCHANGE: neighbor xxx.xxx.51.130 Down BGP Notification sent
Feb 8 10:46:02.550 gmt: %BGP_SESSION-5-ADJCHANGE: neighbor xxx.xxx.51.130 IPv4 Unicast topology base removed from session BGP Notification sent
Feb 8 10:53:00.392 gmt: %BGP-3-NOTIFICATION: sent to neighbor xxx.xxx.51.131 4/0 (hold time expired) 0 bytes
Feb 8 10:53:00.392 gmt: %BGP-5-NBR_RESET: Neighbor xxx.xxx.51.131 reset (BGP Notification sent)
Feb 8 10:53:00.393 gmt: %BGP-5-ADJCHANGE: neighbor xxx.xxx.51.131 Down BGP Notification sent
Feb 8 10:53:00.393 gmt: %BGP_SESSION-5-ADJCHANGE: neighbor xxx.xxx.51.131 IPv4 Unicast topology base removed from session BGP Notification sent
Feb 9 08:52:14.998 gmt: %BGP-3-NOTIFICATION: sent to neighbor xxx.xxx.51.130 active 6/3 (Peer De-configured) 0 bytes
Feb 9 08:52:14.998 gmt: %BGP-5-NBR_RESET: Neighbor xxx.xxx.51.130 active reset (Neighbor deleted)
Feb 9 08:52:15.000 gmt: %BGP-5-ADJCHANGE: neighbor xxx.xxx.51.130 active Down Neighbor deleted
As you can see, the time of each device is spot on…
XXXXXXXXXXXX#sho time
Load for five secs: 0%/0%; one minute: 1%; five minutes: 1%
Time source is NTP, 11:45:02.469 gmt Mon Feb 12 2018
root@xxxxxxx:/opt/librenms/# uptime
11:45:02 up 3 days, 21:18, 1 user, load average: 0.03, 0.42, 0.60
And the config validation only complains about some weathermap permissions, as expected.
Looking just the event log looks like alert rule triggered at 2-8-18 10:56 then it was ack at 16:23 which doesn’t clear the alert btw. Then the alert cleared “recovered” at 12:46 which lines up with the device event log.
Only else i would check is the system time on your server that is running librenms.
Actually I don’t think we delete bgp sessions until discovery runs which is every 6 hours at 33 mins past. That ties in with your events at 12:46 I expect.