Erroneous device down alerting - help with polling, optimization and thresholds

tnielsen2 · 2 January 2019 19:26

Hello,

I am currently migrating my Solarwinds environment to LibreNMS. I have imported all of my nodes and now I am having problems with erroneous device downs. I believe this has to do with the ports polling timing out before a specific threshold.

I have a pair of VSS Cisco 4500x chassis, a stack of 5x 3750s, and a Brocade VDX 8770 chassis, which all have a lot of ports to poll via SNMP.

Solarwinds can poll these devices just fine, but these three devices give me alerting on a Device Down alert.

The alert is currently configured as follows:

macros.device_down = Yes
devices.status_reason = 'icmp'

I originally investigated the issue without the icmp condition, but the problem appeared to be related to SNMP timeouts, so I read the forum and attempted to tailor the alert to ICMP outages only, just to see.

ICMP connectivity is certainly not down. These devices are reliable for their sites and for my other SNMP poller (Solarwinds) I am migrating from.

I read the optimization guide.

I am using the official Docker image. (https://github.com/librenms/docker). This image uses RRD, MariaDB, and Docker Compose.
I updated today with the same issue.
I originally thought the limited CPU of these devices couldn’t handle getting polled by both Solarwinds and LibreNMS simultaniously, so I removed them from Solarwinds with the same result.
I tried to run poller.php against one of the devices in question. Below are the results:

/opt/librenms/poller.php core.3750 2019-01-02 12:14:13 - 1 devices polled in 596.2 secs  
SNMP [52/588.49s]: Get[22/240.93s] Getnext[0/0.00s] Walk[30/347.56s]
MySQL [1164/4.71s]: Cell[27/0.05s] Row[-26/-0.05s] Rows[58/0.24s] Column[2/0.00s] Update[813/2.96s] Insert[284/1.48s] Delete[6/0.02s]
RRD [334/0.27s]: Update[0/0.00s] Create [0/0.00s] Other[334/0.27s]

My performance polling history shows that the longest module to poll is “ports” at approxmiately 1000 ms.
I need some clarification as to what Mac SNMP Max Repeaters actually does. When following the official guide, I used the following syntax (https://docs.librenms.org/Support/Performance/)

time snmpbulkwalk -v2c -cpublic HOSTNAME -Cr<REPEATERS> -M /opt/librenms/mibs -m IF-MIB IfEntry

However, when I attempted to replace ‘public’ and HOSTNAME , with the appropriate information I got the following error:

-Cr: Unknown Object Identifier (Sub-id not found: (top) -> -Cr)

Below is the output of my validate.php command:

bash-4.4# /opt/librenms/validate.php
====================================
Component | Version
--------- | -------
LibreNMS  | 1.47
DB Schema | 275
PHP       | 7.2.13
MySQL     | 10.2.20-MariaDB-1:10.2.20+maria~bionic
RRDTool   | 1.7.0
SNMP      | NET-SNMP 5.7.3
====================================

[OK]    Composer Version: 1.8.0
[OK]    Dependencies up-to-date.
[OK]    Database connection successful
[OK]    Database schema correct
[WARN]  IPv6 is disabled on your server, you will not be able to add IPv6 devices.
[WARN]  Your install is over 24 hours out of date, last update: Sun, 30 Dec 2018 14:29:16 +0000
        [FIX]: 
        Make sure your daily.sh cron is running and run ./daily.sh by hand to see if there are any errors.
[WARN]  Your local git branch is not master, this will prevent automatic updates.
        [FIX]: 
        You can switch back to master with git checkout master
[FAIL]  We have found some files that are owned by a different user than librenms, this will stop you updating automatically and / or rrd files being updated causing graphs to fail.
        [FIX]: 
        sudo chown -R librenms:librenms /opt/librenms
        sudo setfacl -d -m g::rwx /data/rrd /data/logs /opt/librenms/bootstrap/cache/ /opt/librenms/storage/
        sudo chmod -R ug=rwX /data/rrd /data/logs /opt/librenms/bootstrap/cache/ /opt/librenms/storage/
        Files:
         /opt/librenms/cache/os_defs.cache

Thanks in advance. Any help is appreciated.

murrant · 2 January 2019 20:52

It sounds like you didn’t set repeaters properly, it should be a number.

I suggest you look at selective port polling as mentioned in the docs.

You should also fix the validate issues.

tnielsen2 · 2 January 2019 21:46

I attempted to use the following command string from within the container, with the following output/error. I am obviously doing something wrong, and have tried other integers other than 50 (I started with 10, and up to 32).

Can you help me figure out what I am doing wrong here with the syntax?

bash-4.4# time snmpbulkwalk -v2c -cpublic core.internal -Cr 50 -M /opt/librenms/mibs -m IF-MIB IfEntry
-Cr: Unknown Object Identifier (Sub-id not found: (top) -> -Cr)

bash-4.4# time snmpbulkwalk -v2c -cpublic core.internal -Cr50 -M /opt/librenms/mibs -m IF-MIB IfEntry                       
-Cr50: Unknown Object Identifier (Sub-id not found: (top) -> -Cr50)

bash-4.4# time snmpbulkwalk -v2c -cpublic 'core.internal' -Cr 50 -M /opt/librenms/mibs -m IF-MIB IfEntry
-Cr: Unknown Object Identifier (Sub-id not found: (top) -> -Cr)

bash-4.4# time snmpbulkwalk -v2c -cpublic 'core.internal' -Cr '50' -M /opt/librenms/mibs -m IF-MIB IfEntry 
-Cr: Unknown Object Identifier (Sub-id not found: (top) -> -Cr)

tnielsen2 · 2 January 2019 21:50

I have also fixed my errors as per your suggestion, with the exception of the git errors. I get the following error when trying to cleanup this error:

bash-4.4# find / -name github-remove
 /opt/librenms/scripts/github-remove
bash-4.4#  /opt/librenms/scripts/github-remove
usage: github-remove [-h] (-d | -s | -r) [-v]
github-remove: error: one of the arguments -d/--discard -s/--save -r/--restore is required

====================================
Component | Version
--------- | -------
LibreNMS  | 1.47
DB Schema | 275
PHP       | 7.2.13
MySQL     | 10.2.20-MariaDB-1:10.2.20+maria~bionic
RRDTool   | 1.7.0
SNMP      | NET-SNMP 5.7.3
====================================

[OK]    Composer Version: 1.8.0
[OK]    Dependencies up-to-date.
[OK]    Database connection successful
[OK]    Database schema correct
[WARN]  IPv6 is disabled on your server, you will not be able to add IPv6 devices.
[WARN]  Your install is over 24 hours out of date, last update: Sun, 30 Dec 2018 14:29:16 +0000
        [FIX]: 
        Make sure your daily.sh cron is running and run ./daily.sh by hand to see if there are any errors.
[WARN]  Your local git branch is not master, this will prevent automatic updates.
        [FIX]: 
        You can switch back to master with git checkout master
[WARN]  Your local git contains modified files, this could prevent automatic updates.
        [FIX]: 
        You can fix this with ./scripts/github-remove
        Modified Files:
         bootstrap/cache/.gitignore
         storage/app/.gitignore
         storage/app/public/.gitignore
         storage/debugbar/.gitignore
         storage/framework/cache/.gitignore
         storage/framework/sessions/.gitignore
         storage/framework/testing/.gitignore
         storage/framework/views/.gitignore
         storage/logs/.gitignore

Keep in mind this is the state that the official Docker images come with the vanilla docker-compose up commands. To add, gitignore files shouldn’t matter at all.

murrant · 3 January 2019 05:12

Oddly enough, those are permissions issues.

The easiest way to fix it is to git checkout each .gitignore file listed.

tnielsen2 · 3 January 2019 19:41

Hi Murrant,

A few things to note with the troubleshooting of these issues that I have discovered.

Solarwinds polling, along with Datadog polling, along with LibrenNMS polling was interfering with the CPU of the 3750 stack. I have suspended polling from all of the non Libre platforms with success for the 3750 stack.
I had a stale DNS entry for the 4500x network switches, it appears that the Docker container was not caching DNS, but rather doing round robin. After removing the stale DNS entry, I no longer get false positives on this node.
I have one more node that is problematic, which is the Brocade (now Extreme) 8770. It keeps providing me with false positives and am having some problems identifying why this is the last node.

As per your suggestion, those permission issues have been fixed, and below is now the output of validate.php. After running validate.php a few times to see the results, intermittently, I am getting an error for devices not being polled within the past 5 minutes, possibly due to performance issues.

The node that is erroneously reporting down, are taking anywhere between 120 - 200 seconds to finish polling, with the ports polling module taking the most time.

This leads me back to the performance issue support page, which brings me back to a roadblock that I was experiencing earlier. I am attempting to run an snmpbulkwalk through a shell on the librenms container, and am having syntax issues with identifying how long specific snmp bulk walks are taking.

When I run the following snmpbulkwalk through the container, the syntax issue I have is not clear to me and was hoping you could point me in the correct direction.

As per the documentation, i ran the following command:

time snmpbulkwalk -v2c -c 'omitted' core.brocade.internal -M '/opt/librenms/mibs' -m IF-MIB IfEntry

Now, after doing some reading, I understand that I have to pick a specific MIB to search for, but I am not sure what that MIB is for that Brocade core. I left off from here and kept trying something else. If I could make a suggestion on the documentation, it would be to give a little more detail on what Max Repeaters does (from a technical perspective), along with an example of it working with a device in a lab/production.

After the Brocade core came back “online”, I configured the timeout to 300 seconds with 2 retries, and it has seemed to fix the issue at hand with it alerting on it being down. I think this is not alerting though, because it is likely in the middle of a “retry” or wating for an SNMP poll timeout.

This begs the question, what IS the SNMP polling timeout by default?

To add to this, it seems as if that one particular device might have some slow response times… 3 of my other 8770s are polling and replying just fine, which is certainly odd.

Where it became a little bit of an issue to troubleshoot this last core/device though, is the fact that in order to set the SNMP timeout through the UI, the device had to be online. It would be preferable if this could be configured regardless of the node state.

I am happy to answer any questions on this, but it seems that the Docker container is pretty solid, out of the box with RRD, MariaDB and the app.

Thanks for your assistance on nailing this down.

bash-4.4# /opt/librenms/validate.php
====================================
Component | Version
--------- | -------
LibreNMS  | 1.47
DB Schema | 275
PHP       | 7.2.13
MySQL     | 10.2.20-MariaDB-1:10.2.20+maria~bionic
RRDTool   | 1.7.0
SNMP      | NET-SNMP 5.7.3
====================================

[OK]    Composer Version: 1.8.0
[OK]    Dependencies up-to-date.
[OK]    Database connection successful
[OK]    Database schema correct
[WARN]  Some devices have not been polled in the last 5 minutes. You may have performance issues.
        [FIX]: 
        Check your poll log and see: http://docs.librenms.org/Support/Performance/
        Devices:
         core.brocade.internal
[FAIL]  Discovery has not completed in the last 24 hours.
        [FIX]: 
        Check the cron job to make sure it is running and using discovery-wrapper.py
[WARN]  IPv6 is disabled on your server, you will not be able to add IPv6 devices.
[WARN]  Your install is over 24 hours out of date, last update: Sun, 30 Dec 2018 14:29:16 +0000
        [FIX]: 
        Make sure your daily.sh cron is running and run ./daily.sh by hand to see if there are any errors.

murrant · 3 January 2019 19:59

http://www.net-snmp.org/docs/man/snmpcmd.html Timeout information.

I don’t think you want it to set it to 300s, this is for EVERY snmp request. For most of my devices LibreNMS makes 20-60 snmp requests per poll. So, if the device isn’t responding very well to snmp, polling could take several HOURS to complete instead of the less than 5 minutes it needs to be.

You should check device firmware versions as snmp performance can vary widely based on the firmware version.

Reducing the amount of data polled will help in most of these situations too (if that is acceptable is up to you though)

tnielsen2 · 4 January 2019 02:15

It appears I spoke too soon regarding getting this fixed. I am still receiving down alerts on these three noes.

If I am setting the SNMP it in the GUI per device, does this impact the global setting? Or does this only impact that one node?

The polling module takes 900 seconds to complete polling the ports module, but I need this module more than I need the others. The next slowest module is 200 seconds.

Blindly, I have tried upping the max OIDs per device, along with Max Repeaters with no results. MYSQL has the recommended optimization, I am running RRD.

I have not tried fping tuning, as the icmp replys are incredibly fast.

I have not attempted to optimize the poller-wrapper. Currently I am running 4x single core vcpus on the Docker host with 12 gigs of ram. You think this might be a contributing factor? CPU usage on that host is low, so I did not do this (I am only polling about 200 nodes).

murrant · 4 January 2019 06:19

You are waiting on the devices, not running into a performance issue on the server.

Sometimes the devices aren’t well optimized to handle the large amount of snmp data that LibreNMS can request.

Your only option is selective ports polling.

tnielsen2 · 4 January 2019 19:51

Thanks for the help. I am going to mark this as solved. I appreciate it!

jongalli · 6 January 2019 09:16

If there is a requirement, or necessity to poll all ports and these devices have long poll times, would the next recommended approach be distributed polling?

murrant · 6 January 2019 14:08

No, that would not help at all.

Your only choice is to get the device manufacturer to fix it at that point.

Or chose some ports that are less important and disable them with selected port polling.