We’re seeing an increasing number of false positives relating to BGP sessions across our Cisco router and L3 switch estate. At first I thought they were genuine, but I’ve searched the logs and the sessions are solid and have been up for at least a few weeks, most more than a year. Other than these errors the devices appear stable in LibreNMS.
I’ve optimised the poller and I can see all devices are successfully polled within a window of about 200s. Not sure where to look first so some guidance would be really appreciated.
The issue has occurred across multiple devices including ASR1002, ME3600X.
These issues are all based on SNMP polled devices, not traps. We get monitoring alerts that look exactly as if the BGP session has died, happens across multiple monitored devices, and seems to relate to the same peers on the respective devices. On checking the logs of each monitored device there is no corresponding logged drop of the BGP session. The phantom outage usually lasts for a single polling cycle, so after 5 mins the session appears to come back up.
We didn’t used to get any false positives of this type and have been running the server for a long while, so I feel like something has changed, but not really sure where to start looking to understand what or why.