Rrd_fetch_r failed

jaysenlinux · 11 September 2021 21:18

I am having an issue with some of my graphs being created and was hoping to get some insight as to what might be going on here. I am running LibreNMS on an Ubuntu 20.04 LTS server with rrdcache. I’ve followed all of the documentation to set it up and for the most part it all seems to be working well with the exception of the following errors that I’m seeing under Poller > Performance. See Attached Screenshots

I’ve run the ./daily.sh and the ./validate.php scripts and receive no errors. The poller-perf.rrd file does exist and permissions seem to be correct but it appears no data is bring written. Could someone help me figure out what might be going on and point me in the right direction as to where to start looking? I’ve spent a few days trying to figure this out and I’m not having much success.

librenms@librenms:~$ ./validate.php

Component	Version
LibreNMS	21.8.0-50-g055895e4a
DB Schema	2021_25_01_0129_isis_adjacencies_nullable (217)
PHP	7.4.3
Python	3.8.10
MySQL	10.3.31-MariaDB-0ubuntu0.20.04.1
RRDTool	1.7.2
SNMP	NET-SNMP 5.8
====================================

[OK] Composer Version: 2.1.6
[OK] Dependencies up-to-date.
[OK] Database connection successful
[OK] Database schema correct
librenms@librenms:~$

root@librenms:/opt/librenms/rrd/216.21.15.135# total 20484
drwxrwxr-x+ 2 librenms librenms drwxrwxr-x+ 926 librenms librenms -rw-r–r-- 1 librenms librenms -rw-r–r-- 1 librenms librenms -rw-r–r-- 1 librenms librenms -rw-r–r-- 1 librenms librenms -rw-r–r-- 1 librenms librenms -rw-r–r-- 1 librenms librenms -rw-r–r-- 1 librenms librenms -rw-r–r-- 1 librenms librenms -rw-r–r-- 1 librenms librenms -rw-r–r-- 1 librenms librenms -rw-r–r-- 1 librenms librenms -rw-r–r-- 1 librenms librenms -rw-r–r-- 1 librenms librenms -rw-r–r-- 1 librenms librenms -rw-r–r-- 1 librenms librenms -rw-r–r-- 1 librenms librenms -rw-r–r-- 1 librenms librenms -rw-r–r-- 1 librenms librenms -rw-r–r-- 1 librenms librenms -rw-r–r-- 1 librenms librenms -rw-r–r-- 1 librenms librenms -rw-r–r-- 1 librenms librenms -rw-r–r-- 1 librenms librenms -rw-r–r-- 1 librenms librenms ls -al
4096 Sep 8 20:02 .
36864 Sep 10 17:47 …
171272 Sep 11 13:14 availability-2592000.rrd
171272 Sep 11 13:30 availability-31536000.rrd
171272 Sep 11 13:59 availability-604800.rrd
171272 Sep 11 13:59 availability-86400.rrd
0 Sep 8 19:31 netstats-snmp.rrd
0 Sep 8 19:31 ping-perf.rrd
0 Sep 8 19:31 poller-perf-applications.rrd
0 Sep 8 19:30 poller-perf-availability.rrd
0 Sep 8 19:31 poller-perf-bgp-peers.rrd
0 Sep 8 19:30 poller-perf-core.rrd
0 Sep 8 19:31 poller-perf-customoid.rrd
0 Sep 8 19:31 poller-perf-entity-physical.rrd
0 Sep 8 19:31 poller-perf-hr-mib.rrd
0 Sep 8 19:30 poller-perf-ipmi.rrd
0 Sep 8 19:31 poller-perf-ipSystemStats.rrd
0 Sep 8 19:30 poller-perf-mempools.rrd
0 Sep 8 19:31 poller-perf-mpls.rrd
0 Sep 8 19:31 poller-perf-netstats.rrd
0 Sep 8 19:31 poller-perf-ntp.rrd
0 Sep 8 19:31 poller-perf-ospf.rrd
0 Sep 8 19:30 poller-perf-os.rrd
0 Sep 8 19:31 poller-perf-ports.rrd
0 Sep 8 19:30 poller-perf-processors.rrd
0 Sep 8 19:31 poller-perf.rrd

When I look at the RRD command, I see the following error:

ERROR: rrdcached@unix:/run/rrdcached.sock: rrd_fetch_r failed: mmaping file ‘/opt/librenms/rrd/216.21.15.135/poller-perf.rrd’: Invalid argument

Thanks in advance for the help.

murrant · 13 September 2021 00:51

Many of those files have 0 size.

Are you perhaps out of disk space? (or were you recently?)

If not, delete all the 0 sized files and let LibreNMS recreate them

jaysenlinux · 13 September 2021 01:07

Hello!

Thanks for your response. Disk space doesn’t appear to be an issue. I’m not seeing any errors indicating that I am out of space. I’m curious what the invalid argument error at the end of the rrd command is all about.

I went ahead and removed all the 0 sized files per your suggestion. Watching to see if they are recreated.

Thank You!

jaysenlinux · 13 September 2021 02:13

After deleting the 0 sized files. LibreNMS did recreate them and it is now writing to them. However I still have the same error on now another device. I suspect I have 0 sized files on all of my devices that once removed and recreated it will resolve the problem. It’s looking like I have some work to do to clear them out. I have over 1000 devices. This is going to be fun.

jaysenlinux · 13 September 2021 03:26

I was able to use a find command to search for all the 0 sized files in my rrd directly and remove them quickly. The graphs are now showing under Total Poller Time but now I get broken images under Total Poller Time Per Module.

If I click on one of the broken images to see if it gives me an error or a clue as to what’s going on. It will spin it’s wheels for about a minute and then I get a gateway timeout error. I’m not sure what is causing the gateway timeout now. It wasn’t doing that before I deleted the 0 sized files.

I am running rrdcached and when I do a systemctl status on it. It did initially give me an error about it not being able to read an RRD file. I restarted rrdcached and that error is now gone and it seems to restart normally. I also restarted nginx and php7.4-fpm services and they restart normally with no errors.

I ran ./daily.sh and ./validate.sh and get no errors.

Could there be something in my cache that is causing it to hang up and give me gateway timeouts? Any thoughts?

murrant · 14 September 2021 16:54

Probably not all the rrd files exist for the total poller time graph. Give it time to get data back and it should be working again.

jaysenlinux · 15 September 2021 17:33

The total poller time graphs are showing up. It’s the Total Poller Time Per Module graphs below those graphs that are broken as shown in my screenshot. The other day I tweaked some php-fpm settings to increase the number of servers and I did get one of those graphs to show up but now it’s back to the broken images again. It’s been days and still nothing appears. I’m not sure how much time needs to pass before they will show up but it doesn’t seem normal to have broken images.

Thanks Again

jaysenlinux · 26 September 2021 04:45

Looking further at my librenms error log. I am getting this:

tail /opt/librenms/logs/error_log
2021/09/25 20:20:02 [error] 966#966: *1 upstream timed out (110: Connection timed out) while reading response header from upstream, client: x.x.x.x, server: librenms.server.com, request: “GET /graph.php?type=global_poller_modules_perf&legend=yes&height=176&width=379&from=1632539700 HTTP/2.0”, upstream: “fastcgi://unix:/run/php7.4-fpm-librenms.sock”, host: “librenms.server.com”, referrer: “https://librenms.server.com/poller/performance”

This tells me there is something going on with php-fpm. I’ve tried over the last week various performance tweaks including increasing the max children and max servers. I’ve thrown more CPU and RAM at it. I’ve even tried switching to static mode and increasing the max children and max requests to an insanely high number. It’ll work temporarily but then back to timing out again. I’ve even tried changing my nginx worker processes to match the number of CPU cores and also increased the number of worker connections. I turned multi accept on. I tried disabling access logging to free up disk i/o and memory. Nothing seems to fix this. I’m at a loss.

Each php process that spawns takes up around 52MB of memory.

I am running this on a proxmox host and my VM setup is follows:

Ubuntu 20.04 LTS
php7.4/php7.4-fpm with opcache enabled
rrdcache is also enabled
CPU: 16 total cores split between 2 sockets
32GB RAM
100GB disk space

I’m polling maybe around 1000 or so devices, mostly ubiquiti, mimosa, and cambium devices.

Every other librenms page seems to load very quickly. It’s just the poller performance pages that are an issue.

Any more ideas or things to try would be very helpful.

Jellyfrog · 26 September 2021 09:12

Increase your timeouts in nginx and php

jaysenlinux · 26 September 2021 18:12

Thank You! I increased the timeouts to 300s and that did get the graphs to load and resolve the timeout issue. However it still takes an incredibly long time to load them. Seems like it takes between 2-4 minutes or more to load. This is only in the poller performance page. The rest of librenms seems to load fairly quickly. Do you have any further suggestions on how to speed that up? With 16 cores and 32GB of RAM I would think performance would be better. Proxmox shows it’s only using around 23% - 26% of RAM and 22% - 32% of CPU. Not sure why this is so slow.

system · 27 December 2021 00:13

This topic was automatically closed 91 days after the last reply. New replies are no longer allowed.