Device types/groups list showing blank, affecting alerts

Uncle_Heavy · 1 April 2025 19:22

When I navigate to Devices > All Devices (or) Device Groups > Network (or) Servers, the page comes up blank except for the “No results found!” message. It started with just Network group devices, but started happening with Server group and now with the Global Settings page.

Holding Ctrl and hitting F5 a bunch eventually makes the page spit out actual content, but can take up to 20 or so refreshes.
All devices were and are still accessible via Availability Map or the search box in the upper right.

Earlier this month, I started seeing these errors regarding alerts in the Recent Events area:

Could not issue recovery for rule 'ICMP Response Failure: Server' to transport 'mail' Error: Transport delivery failed with 0 for Server Down: No ICMP Response: Message body empty

And

#0 /opt/librenms/LibreNMS/Alert/RunAlerts.php(623): LibreNMS\Alert\Transport\Mail->deliverAlert()
#1 /opt/librenms/LibreNMS/Alert/RunAlerts.php(265): LibreNMS\Alert\RunAlerts->extTransports()
#2 /opt/librenms/LibreNMS/Alert/RunAlerts.php(574): LibreNMS\Alert\RunAlerts->issueAlert()
#3 /opt/librenms/alerts.php(61): LibreNMS\Alert\RunAlerts->runAlerts() #4 {main}
Transport delivery failed with 0 for Server Down: No ICMP Response: Message body empty

“Server Down: No ICMP Response” is the title of this particular alert transport.

The Workstation group is a single VM that I use to test alerts, so I started moving that from group to group, bringing it down and back up, to see what happened. Alerts went out fine when this test VM was in Workstation group, but the above errors appeared when it was placed in the Server or Network groups, and I think I see a correlation, though I don’t know exactly what it implies.

Going to Devices > whichever > Workstation will, every time, reliably, list the one single member device of that group; with that, I realize that this blank listing error began when we started having more than…fifty? or so? devices in a group.

The errors above state ‘Message body empty’, which I think may be caused by the same issue that causes the device list pages to show blank: initially, the alert rules matched on the group, e.g. Servers; I changed that to include devices.type=Server rather than matching a group, and the alert failed with the same error.
I think this behavior echoes the failure of the device listing page to display correct content, but I can’t find any reference to anyone else having this issue to confirm.

Also just right now: “No results found!” in the Eventlog widget on my dashboard.

Git says there’s no issues.
Valdate.php currently and always has shown all good:

===========================================
Component | Version
--------- | -------
LibreNMS  | 25.3.0-76-gb94876a39 (2025-03-31T18:55:29-04:00)
DB Schema | 2025_03_19_205700_fix_ospfv3_ports_table (331)
PHP       | 8.3.6
Python    | 3.12.3
Database  | MariaDB 10.11.11-MariaDB-0ubuntu0.24.04.2
RRDTool   | 1.7.2
SNMP      | 5.9.4.pre2
===========================================

[OK]    Composer Version: 2.8.6
[OK]    Dependencies up-to-date.
[OK]    Database connection successful
[OK]    Database connection successful
[OK]    Database Schema is current
[OK]    SQL Server meets minimum requirements
[OK]    lower_case_table_names is enabled
[OK]    MySQL engine is optimal
[OK]    Database and column collations are correct
[OK]    Database schema correct
[OK]    MySQL and PHP time match
[OK]    Active pollers found
[OK]    Dispatcher Service not detected
[OK]    Locks are functional
[OK]    Python poller wrapper is polling
[OK]    Redis is unavailable
[OK]    rrd_dir is writable
[OK]    rrdtool version ok

Anyone have any insight on this?

laf · 2 April 2025 09:43

Check your web server logs, you might find you’re hitting max memory limits in php.

Uncle_Heavy · 2 April 2025 14:49

Web server logs. How did I not…nngh. Thank you.

From Nginx error log:
2025/04/02 08:53:55 [crit] 2948686#2948686: *197 open() "/var/lib/nginx/fastcgi/7/00/0000000007" failed (13: Permission denied) while reading upstream, client: xx.xx.xx.xx, server: librenms.host, request: "POST /ajax/table/device HTTP/1.1", upstream: "fastcgi://unix:/run/php/php-fpm-librenms.sock:", host: "librenms.host", referrer: "http://librenms.host/devices/type=network"

That line appears every time a page loads blank. So a permission issue. Straightforward enough:

someguy@librenms:~> ls -l /run/php/php-fpm-librenms.sock
srw-rw---- 1 librenms librenms 0 Apr  1 06:53 /run/php/php-fpm-librenms.sock

That makes sense, but

someguy@librenms:~> grep 'user' /etc/nginx/nginx.conf
user librenms;

So Nginx is running as librenms, and the file in question is owned by librenms:librenms, so who’s being denied, and why does it occasionally work correctly?

The default user for Nginx would be www-data otherwise, so I tried adding www-data to group librenms just to see, but that had no effect.

laf · 2 April 2025 15:15

You php-fpm pool config is wrong, it should be listening as www-data but the user is librenms

root@librenms:~# grep -P 'user =|group =|listen.owner|listen.group' /etc/php/8.3/fpm/pool.d/librenms.conf 
user = librenms
group = librenms
listen.owner = www-data
listen.group = www-data

Uncle_Heavy · 2 April 2025 16:02

Ah ok, I see.
Changed listen.owner and listen.group from librenms to www-data, reloaded PHP and Nginx. File php-fpm-librenms.sock now owned by www-data:www-data.

Unfortunately, same behavior and errors as before, and alerts still can’t send.

laf · 2 April 2025 16:23

Did you revert the user nginx is running as? That should also be www-data.

You’ve deviated from the install docs with this, it’s probably worth going back over them and double checking everything.

Uncle_Heavy · 2 April 2025 17:53

I haven’t; install docs say that user should be librenms.

Parameters listen.owner and listen.group both being librenms was my error – the www.conf file that librenms.conf is copied from has both those parameters as www-data and the instructions don’t mention changing those, so that one’s on me, no doubt. Changing it doesn’t seem to have made a difference, tho; the same pages show blank and the same error shows in the logs.

Alerts are sending reliably, but I’m uncertain, now, about that being related – I found out earlier that some “work” had been done on the alert configs and I’m fair certain that such was the cause of the alert failures, because I changed that and the alerts started sending immediately.

laf · 2 April 2025 19:10

www-data. You’ve changed your overall web server user in nginx.conf

Uncle_Heavy · 2 April 2025 21:07

Hm. I did not intend to do that.
Hang on, jfc I think I shot myself in the foot back when I installed this.

someguy@librenms:/etc/nginx> grep user nginx.conf
user librenms;

Install doesn’t say anything about making this change.
Ok, ok so I edited nginx.conf when I wasn’t supposed to. Must have made sense at the time. I guess. I’ve done worse.

Then the above should return:
user www-data;

Yes?

laf · 2 April 2025 21:22

Yup, then restart the web server

Uncle_Heavy · 2 April 2025 21:55

Hot damn, there it is, every time.

Thank you much.

Any idea why the permission error was a ‘most of the time but not always’ thing?

laf · 2 April 2025 22:17

None of it should have worked but some caching somewhere I expect.

system · 1 July 2025 22:17

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.