Issues with devices showing offline even though they're not

LD9K · 8 June 2025 01:19

I’m running LibreNMS on docker using compose. My environment is docker running on Ubuntu.

I just migrated from Observium to LibreNMS (By “migrated” I mean used addhost.php to manually add around 60 devices. Lost the historical data but I’m not worried about that). I’m finding that some of my devices (it seems random - some are windows devices on snmp v2c, some are Linux machines polled over snmp v3 using udp6) seem to go “offline”, because of the aforementioned ICMP issue.

But when I step into the container using docker exec -it librenms /bin/bash and then using fping <hostname>, I get device is alive. If I exectue the command librenms uses to see if the host is up or not, i.e.

'/usr/sbin/fping6' '-e' '-q' '-c' '3' '-p' '500' '-t' '500' '-O' '0' '<hostname>'

I get an output stating it is up:

<hostname> : xmt/rcv/%loss = 3/3/0%, min/avg/max = 15.1/22.7/37.5

Additionally, normal ping (or ping6) also work so I’m not sure why LibreNMS thinks my device is offline?

Strangely, if I run lnms device:poll all the devices come online, then after a minute or two I see the devices start dropping offline one by one, with Device status changed to Down from icmp check. in the log.

Also worth mentioning - the poller wasn’t working properly (according to validate.php), I googled it, found a post on the librenms forums which said the fix was to create these two files and map them to the librenms service which fixed that issue. I’m not sure if this is related or not but thought I’d mention it here. More details are in the post.

Also, my observium setup (which was working OK) was running on the same hardware so I think it is not likely that this is a networking / routing / firewalling / hardware issue - only diff is that it was running on a VM by itself, whereas libreNMS is deployed via docker.

Let me know if i can provide any further info to help diagnose this.

Output of ./validate.php:

$ php validate.php 
===========================================
Component | Version
--------- | -------
LibreNMS  | 25.5.0 (2025-05-17T09:23:44+12:00)
DB Schema | 2025_05_03_152418_remove_invalid_sensor_classes (338)
PHP       | 8.3.19
Python    | 3.12.10
Database  | MariaDB 10.11.13-MariaDB-ubu2204
RRDTool   | 1.9.0
SNMP      | 5.9.4
===========================================

[OK]    Installed from the official Docker image; no Composer required
[OK]    Database Connected
[OK]    Database Schema is current
[OK]    SQL Server meets minimum requirements
[OK]    lower_case_table_names is enabled
[OK]    MySQL engine is optimal
[OK]    Database and column collations are correct
[OK]    Database schema correct
[OK]    MySQL and PHP time match
[OK]    Active pollers found
[OK]    Dispatcher Service is enabled
[OK]    Locks are functional
[OK]    No python wrapper pollers found
[OK]    Redis is functional
[OK]    rrd_dir is writable
[OK]    rrdtool version ok
[WARN]  Updates are managed through the official Docker image

Output of discovery / poller.php for a host with this issue (In this case, a Fortigate firewall): librennms-diagnostics · GitHub

(output of poller starts on line 4466)

laf · 8 June 2025 15:40

Are you sure that the reason is ICMP for them going down rather than snmp?

In general, we just run fping so if that’s showing at the time, the device is not pingable then theirs nothing more we can do to check. I’d check the host and docker resources to make sure things aren’t overloaded as that seems the only obvious thing if you are 100% sure it’s not genuine transient issues.

LD9K · 9 June 2025 01:09

The device is most definitely not going down; As I mentioned I migrated from Observium, so if the devices were going down I’d definitely have known about it by now.

I rebuilt a fresh Ubuntu 24.04 VM to run LibreNMS on (no docker), and installed everything according to the guide. It went relatively smoothly. I then imported all my devices again using addhost.php. Most went fine, and are showing online until I hit this issue:

I traced this to fping defaulting to IPv6

If I run fping <router>, i get unreachable

If I run fping <router> -4, It is reachable.

So I guess there are questions at this stage:

Is it possible to make fping use ipv4 or ipv6 (ideally based on the transport type passed in to lnms add - i.e. if its udp6 or tcp6 use fping -6, if not use fping -4?
How do I troubleshoot my docker issue where half my hosts are offline for no good reason? I Checked the CPU/Memory usage and it looks OK to me. Are there any known issues with this? I did pour over the documentation but didn’t really find anything.

LD9K · 28 June 2025 03:30

Gave up on running this in a container. I installed a Debian VM and installed LibreNMS on that, working perfectly now.