Device is up, however, Event log/Email Alerting shows - device status changed to Down from snmp check

I have LibreNMS monitoring one of our remote site and everything seems to be working great. However, one of our switches is constantly generating Up/Down Email Alerts. Example:

2018-04-02 04:00:46 up 10.199.0.25 Device status changed to Up from snmp check.
2018-04-02 03:55:52 down 10.199.0.25 Device status changed to Down from snmp check
2018-04-02 01:25:44 up 10.199.0.25 Device status changed to Up from snmp check.
2018-04-02 01:21:00 down 10.199.0.25 Device status changed to Down from snmp check.

This is a Cisco 3650 switch stack (4 physical switches) and I’m monitoring the loopback. This is a high visibility site, so I know it is not down, otherwise we’d be getting calls. I even log into it when I get these alerts to verify.

Is there any way to troubleshoot this to find out why this one device is having these issues? I’m monitoring Cisco routers and HP switches, but this is the only device generating these issues.

I am monitoring the loopback interface, which is in a vrf, if that matters.

Any help would be appreciated.

In libreNMS a device will be marked down if cant be ping or SNMP is not responding. You need to troubleshoot to see why SNMP keeps timing out.

Is the firmware on the device up to date? Has it been restarted?
What is the polling time on the device?
Are seeing any timeouts if you run a poller debug on that device? ./poller.php -h HOSTNAME -d
Have you looked at the performance doc? https://docs.librenms.org/#Support/Performance/
Is this device at a remote location is there lots of latency?

We are running Denali, 16.3.5 and the stack has been up for 40+ days. This is a newer install. The polling time I left as default. However, just now I did change the time from 1 to 3 to see if that helps. I’ve been running the poller, but it doesn’t hang at any certain module. It comes back fine.

We use HP’s NNMi to monitor the site remotely, and the device is responding fine to SNMP for that product. The LibreNMS VS is locally at the site. Just odd. I’ll see if the timeout change helps or keep trying the poller.

1 Like

device will be marked down if cant be ping or SNMP. I think this is not reasonable.
if can code change to device can mark as down when ping and snmp is more reasonale.
because when device can ping but for some reason snmp unreachable it mark as down.
and when ping is down but snmp can reach it also mark as down.
May be we must fix it in next version??