Autodiscover search crashing SNMP/Web interface on some switches

DBMandrake · 7 September 2020 08:13

Hi All,

I’m a relatively new (and very impressed) user of LibreNMS and have spent about a week setting it up and fine tuning it on our network, and I’ve run into a strange issue.

Three of our 36 switches (mostly D-Link) have been “crashing” when polled by LibreNMS. When I say crashing, they continue to switch layer 2 network traffic normally, however they totally stop responding to pings, SNMP queries and the web interface. Basically anything other than layer 2 switching stops working. Although I haven’t verified it I think features like LLDP also stop working.

Once this happens a hard power cycle of the device is required to get it responding again, and so far I haven’t found a configuration for the switch that avoids this problem.

Two of the switches are the same model - D-Link DGS-1210-24, and one is a much older one - DGS-1224T. All are running the latest firmware, which unfortunately is very old now. (Strangely we have other DGS-1224T’s which are not crashing, presumably they are on different firmware versions, I haven’t checked)

After a while I realised it is not the regular 5 minute SNMP polls which are crashing them, but the 6 hourly Autodiscover. I realised this when the crashes where 6 hourly and the “last discovered” time was 5 minutes before the last heard from time…

So as a workaround I’ve individually excluded them from being re-discovered with lines such as:

$config[‘autodiscovery’][‘nets-exclude’][] = ‘10.0.2.19/32’;

Has anyone else seen Autodiscover crashing the management interface on switches ?

Is there any way to tell LibreNMS to automatically exclude already discovered devices from “re-discovery” ? Is there any advantage to already known devices being rediscovered periodically ?

For example, if I replace a switch with a newer model we would normally configure its management IP to be the same as the one it replaced, would LibreNMS pick up all the changes (number of ports, model number etc) during a 5 minute polling or would a full re-discover be necessary to update that info ?

(Of course I could manually tell it to rediscover a device after I have replaced it, I don’t need to rely on periodic auto-rediscover)

DBMandrake · 8 September 2020 11:59

A quick update on this.

We have 6 other DGS-1224T switches which were running a much later firmware version and were not crashing, after updating the firmware version on this one from 4.00.12 to 4.21.02 it is no longer crashing when being polled by SNMP.

In the case of the DGS-1210-24P it seems that while disabling Autodiscover has increased how long it lasts under SNMP polling from 6 hours to about 24 it hours does eventually still see the same problem of the management interface stopping responding.

My guess is there is a memory leak in the SNMP server in firmware version 4.10.023 of this switch and after it has been polled enough times it runs low on memory and stops responding to L3 requests until rebooted.

Killo_RIchards · 8 September 2020 12:58

Great job on finding remedy. I would have suggested a firmware update, especially on switches that might not be up to task. I find that the linksys, dlink, netgear, belkin stuff isn’t really built for reliability. How often does someone need to reboot a SOHO setup? Pretty frequently, likely due to memory leaks.

DBMandrake · 8 September 2020 14:21

We haven’t had too much trouble with enterprise grade managed D-Link switches - most of them have up times of hundreds of days or even multiple years… however when I started running LibreNMS and exercising the SNMP service on them all (previously they were not being actively polled) a small number were found wanting due to running outdated firmware it seems.

I’ve now found an updated firmware version for the DGS-1210-24P - 4.21.B008, so fingers crossed that helps those two switches as well…

Killo_RIchards · 9 September 2020 12:13

I’d consider yourself fortunate. There’s a reason why Cisco and comparable manufacturers equipment costs substantially more than brands like dlink etc. It’s where you’ll see Cisco equipment that has been running without a reboot for 10 years.

Quick stupid story: when I first purchased a home, I signed up for internet and was issued a motorola surfboard docsis modem. It continually dropped, reset, required reboots. ISP came out, tested the lines, replaced the modem and this continued. I reviewed the logs and there wasn’t much I could do.

I ended up slamming in a Cisco router with Docsis 2.0 card installed in the chassis and was able to run without ever needing a reboot, power cycle or any of that garbage.

Hans_Erasmus · 9 September 2020 13:23

@DBMandrake have a look at http://ftp.dlink.ru/pub/ sometimes they have some really cool FW versions on there. We also stock a lot of dlinks and this site has come in handy for example when you want to convert your 3420 into a fully fledged router

Anyways, have seen funky stuff like uptime on Dlink devices being fixed with FW. All in all, Dlink is good value for money (especially if the Cisco price-tag is out of range).

DBMandrake · 9 September 2020 14:20

If you’re talking about consumer/SOHO level D-Link switches then yeah, those are a bit average. But their Layer 3 enterprise managed switches are quite expensive and pretty decent. Stuff like the DGS-3420-52P which we have several of which have never given the slightest trouble.

Thanks for the link - I had come across that link before but unfortunately it doesn’t have any firmware for the DGS-1210-24P. (Power over Ethernet version) Also the DGS-1210-24 firmware there is only for revision A hardware and this unit is revision D1, so totally different underlying hardware. (cpu etc)

The 4.21.B008 firmware I updated to doesn’t seem to have fixed the issue with that model so I’ve opened a support ticket with D-Link to see what they say. Usually a firmware with a B in the last field is a beta firmware so its very odd that the “latest” firmware on the EU D-Link website has a B in the firmware version…

I guess this is getting pretty off topic now so I should probably leave it here!

Jan_Te · 21 April 2021 08:42

Hey there, did you ever get to the bottom of this?

I have a number of DGS-1210-24’s as well as DGS-1210-08P that “crashes” when LibreNMS tries to talk to it. I also have a DGS-1510-28P which works absolutely fine.

Obvservium has no issue talking to it, so I’m not sure that it’s the snmp server on the switch that’s at fault, though?

DBMandrake · 21 April 2021 09:50

Hi,

No unfortunately I was never able to resolve the problem with this model of switch. Our network (approx 40 switches) is almost entirely D-Link’s of many different models and ages, the vast majority of models work perfectly with LibreNMS with no issues at all, but the following models have problems with LibreNMS:

DGS-1210-24P rev D1

As described in this thread “too much” SNMP polling eventually causes the entire management side of the switch to become unresponsive - it stops responding to pings, web interface and SNMP entirely, however it continues to switch network traffic as normal. This happens after typically about 6-12 hours.

I spent a lot of time with D-Link support and was given an unreleased firmware version to try - 4.22.B007 and while this version improves things a bit (the management interface eventually starts responding again instead of requiring a hard power cycle) it does not solve the problem.

In the end I simply disabled SNMP polling and treat it as a ping only device now. Luckily we only have two of these on the network.

DGS-1210-24P rev G1

This model has a similar but slightly different problem - it stops responding to pings (and SNMP/web interface) periodically, for random amounts of time, anywhere from a few seconds to a few minutes, then it recovers as if nothing was wrong. Like the earlier revision of the switch, this has no impact on traffic traversing the switch only on the management features.

I have found on this switch that if I use SNMP v2c the issue is much worse than if I use SNMP v3, so I use SNMP v3 (unauthenticated) to minimise the problem with this switch. Luckily we also only have two of this model.

My ping down and SNMP down alerts both also have a 6 minute alert delay configured - so that a device has to be down for two consecutive 5 minute polling periods for an alert to be sent. (However down for a single polling period is still logged)

If I didn’t do this I would have multiple false alerts per day due to these two switches, and occasionally from other devices. Currently these switches are running the “latest” firmware 7.20.003. (Although D-Link always seem to have newer versions available which are not on their website if you open a support ticket - I have not bothered with this model)

My gut feeling is that polling a specific SNMP OID is triggering a memory leak which after enough polling periods runs the application processor out of memory and causes the SNMP service and sometimes the web interface to crash due to lack of memory.

It may be that Observium doesn’t poll the specific OID which upsets the switch thus you don’t see a problem with it.

It would probably be possible to work around the issue by setting an OID filter in the SNMP view page of the switch to block access to the troublesome OID - but to do that you’d first have to figure out what it was, and I don’t have the time or expertise to do that, especially on a live switch which I can’t take out of service.

Here are some of the D-Link switches we use which have no problems with LibreNMS polling:

DXS-1100-10TS Rev. A1
DGS-3420-52P Rev. B1
DGS-3130-54PS Rev. A1
DGS-3120-48PC Rev. B1
DGS-3100-24TG
DGS-1250-28XMP
DGS-1224TP (latest firmware only - old versions do crash with prolonged SNMP polling)
DGS-1224T (latest firmware only - old versions do crash with prolonged SNMP polling)
DGS-1210-28P/C1
DGS-1210-24 Rev A1

My advise would be to stay away from the DGS-1210 series as their management features don’t seem to be very stable, at least in the D1 and G1 revisions - the original A1 revision seems OK.

Jellyfrog · 21 April 2021 10:39

In general try;

Update to latest firmware
Report the problem to Dell

If this doesn’t help, try modifying the OS definition yaml;

https://docs.librenms.org/Developing/os/Settings/#disable-snmpbulkwalk

DBMandrake · 21 April 2021 11:12

Done that already…

D-Link you mean. Done that already…

Can you be a little more specific here?

I assume you are referring to includes/definitions/dlink.yaml, but that rather than editing the file directly (which could be overwritten on an update) I should do something like this in config.php?

$config['os']['dlink']['nobulk'] = 'true';

This would affect polling for all our dlink switches though, even the majority of switches which don’t have issues, and make the polling process less efficient. I take it there is no way to enable nobulk only for specific models of dlink switch, or for specific devices in the device list? Ideally I would only want to disable snmpbulkwalk on the affected devices.

Would making the change above affect both 6 hourly discovery and the 5 minute polling? As both seem to contribute to the eventual crashing of SNMP on the switch. (It will last for longer if I blacklist the device for discovery but still eventually crash)

DBMandrake · 21 April 2021 12:06

Well that didn’t take long to confirm it didn’t help.

I used $config['os']['dlink']['nobulk'] = 'true'; in config.php and confirmed by monitoring running processes that dlink devices were indeed being queried with snmpget instead of snmpbulkget however within about 15 minutes of re-enabling SNMP polling of one of the DGS-1210-24P Rev D1 switches it has stopped responding to SNMP:

Timeout: No Response from udp:10.0.2.19:161

It is still responding to pings and web interface for now. I know from past experience that the SNMP service on it will probably not recover until the switch is rebooted.

Really poor showing from D-Link to be honest. Three different firmware versions tried over the last few months and lots of back and forth with support and it still can’t cope with being polled by LibreNMS without SNMP crashing. (On older firmware versions it took down the web interface as well)

I would say definitely avoid the DGS-1210 series. I’m contacting D-Link again to see if there are any further unpublished firmware updates for these two models but I’m not holding my breath for a solution.

DBMandrake · 21 April 2021 12:59

Just wanted to make the observation that disabling snmpbulkwalk has actually dramatically cut the poller time for some switch models.

In particular the DGS-1224T which used to take 140 seconds according to the poller logs now takes under 40 seconds!

Also I’ve just noticed on that particular model that while some data was being gathered previously from the switch successfully, (name, model number etc) interface traffic graphs were not working with snmpbulkwalk, with the traffic graphs - presumably the snmpbulkwalk session was timing out before it got to the per interface data.

Now with bulk transfers disabled all the DGS-1224T are now reporting interface traffic for the first time so I think I’ll leave it set this way. Previously I’d assumed that very old model just didn’t support per interface statistics via SNMP!

So even though this setting didn’t help with the SNMP service crashing on another model of switch, it did help fix a problem I didn’t even know I had.

DBMandrake · 24 May 2021 08:10

Good news, I’ve found a solution to the problem with the DGS-1210-24P Rev D1 and I am now polling both of these switches with SNMP for the last several days successfully.

I was trying to get them working again recently and realised that it was the 6 hourly discovery scan (which finds port names, device model etc) which was crashing the SNMP service on the switch, but the 5 minute poller scan was not crashing it.

Hence it would work for a few hours until the next scheduled discovery scan occurred. Since I first started this thread I’ve learnt about the “Capture” feature in the GUI - somehow I never noticed this feature until quite recently and it is extremely useful and helped me nail down the cause and solution quite quickly.

Once I realised it was the discovery scan causing the issue I manually ran a discovery scan by going to the device page, click on the 3 dot button on the far right then go to capture.

Then I left it on the default of Discovery and pressed run - this immediately runs a discovery scan showing the result in the window below which you can copy and paste.

Every time I would run discovery manually the SNMP service on the switch would hang until after a reboot. Looking at the debug log capture I could see the last scan run before the SNMP hang was the IPv6 one.

I’m not interested in IPv6 data from the switch (I’m not even sure if this switch supports it to be honest) so I simply disabled IPv6 in the discovery module for this specific device.

To do this go into the device page, click the settings icon then go to Modules. Down the right hand side in the discovery modules list set IPv6 to disabled:

After this it will no longer be polled for the specific data that is causing the hang. If the SNMP service is already hung you’ll need to reboot the switch and after that you should find you can manually perform a Discovery scan in the Capture Debug page without the SNMP service hanging and that polling now works normally.

I’m still running the unreleased 4.22.B007 firmware (for Rev D1 hardware) which I was provided by D-Link support, I don’t know if disabling IPv6 is sufficient for the last public firmware release, if you try this workaround let us know. If you still see SNMP hanging after a few hours/days, contact D-Link support to get this firmware version in addition to disabling IPv6 polling.

I also recommend disabling Safeguard engine on this model of switch. In my experience if it is enabled and you poll the switch for the IPv6 SNMP data not only does the SNMP service hang, the Web/telnet interface hangs as well, making it impossible to remotely reboot the switch. (It still keeps switching traffic however) But with Safeguard engine disabled if you do cause the SNMP service to crash during your testing it does not affect other features like the web interface so a remote reboot is still possible.

Good luck and let us know if a similar fix works for you, assuming your switches are also Rev D1.

The issue here is clearly a bug in the SNMP implementation in the switch as when it receives a certain request it hangs forever, however no more firmware versions are forthcoming from D-Link (I checked again with support) and due to the age of the switch I think this bug will likely never be fixed, however disabling the IPv6 scan seems like a reasonable workaround especially when you can do it per device.

Jan_Te · 24 May 2021 11:57

That seems to have solved it - I added the switch to my LibreNMS instance, quickly disabled ipv6 (don’t need it) and also disabled safeguard engine on the switch.
So far so good! (I am using a DGS-1210-08P at the moment only, i got rid of the 24 port switches). This is a D2 hardware revision with latest public firmware, 4.21.B008

DBMandrake · 24 May 2021 13:01

Fingers crossed.

Did you manually run the discovery from capture debug information successfully without it crashing ?

Otherwise I wouldn’t expect you to see a problem immediately as the default cron schedule for discovery is only every 6 hours.

In my case with IPv6 discovery enabled it would crash SNMP every time I manually ran discovery, with it disabled I can run discovery many times in a row with no crash, and touch wood it has been working fine since Friday…

system · 31 May 2021 13:02

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.