Hi LibreNMS Community,
I currently have a weird issue where RRDCached has been randomly failing to connect for the past couple of weeks, and I’m not making much progress with my current troubleshooting.
It has been running solidly for over a year, nothing has been changed at the time this issue started.
Updates under daily.sh were disabled intentionally to comply with our internal change controls after a few previous updates broke our instance overnight, so we opted to update manually when we can.
Part of the troubleshooting steps was to upgrade our instance to the latest version and see if that resolves the issue, however I’m having a lot of problems going from 1.69-22 to 21.8.0-55. But that’s a job for another day.
Has anyone had an issue where RRDCached encounters the following error (consistently at the same 23-24 hour mark after being restarted)?
/var/log/syslog
Sep 20 09:09:52 librenms rrdcached[13413]: listen_thread_main: accept(2) failed.
Sep 20 09:09:52 librenms rrdcached[13413]: message repeated 10 times: [ listen_thread_main: accept(2) failed.]
Sep 20 09:09:52 librenms rrdcached[13413]: listen_thread_main: accept(2) failed.
Sep 20 09:09:52 librenms rrdcached[13413]: message repeated 27 times: [ listen_thread_main: accept(2) failed.]
Sep 20 09:09:52 librenms rrdcached[13413]: listen_thread_main: accept(2) failed.
Sep 20 09:09:57 librenms rrdcached[13413]: message repeated 375087 times: [ listen_thread_main: accept(2) failed.]
I’m not able to find much via Google searching, I’ve tried changing the locking service for the clustered dispatcher pollers from SQL to REDIS per Scaling LibreNMS - LibreNMS Docs as I thought any SQL performance issues might have been causing it to crash.
/opt/librenms/.env - CACHE_DRIVER=redis
RRDCached is already the latest version for the OS. I could look at upgrading the OS to 20.04 LTS which will update RRDCached to 1.7.2 I believe.
~$ sudo apt-get install --only-upgrade rrdcached
Reading package lists… Done
Building dependency tree
Reading state information… Done
rrdcached is already the newest version (1.7.0-1build1).
RRDCached Configuration on Core Server - /etc/default/rrdcached
DAEMON=/usr/bin/rrdcached
DAEMON_USER=librenms
DAEMON_GROUP=librenms
WRITE_THREADS=4
WRITE_TIMEOUT=1800
WRITE_JITTER=1800
BASE_PATH=/opt/librenms/rrd/
JOURNAL_PATH=/var/lib/rrdcached/journal/
PIDFILE=/var/run/rrdcached.pid
SOCKFILE=/var/run/rrdcached.sock
SOCKGROUP=librenms
BASE_OPTIONS="-B -F -R"
NETWORK_OPTIONS="-L"
I did note that my configuration has the path “SOCKFILE=/var/run/rrdcached.sock
”, whereas the documentation has “SOCKFILE=/run/rrdcached.sock
”
When checking the path under the host OS it appears to be the same anyway.
Our current LibreNMS setup, all running on Ubuntu 18.04 LTS. (Not including our Redis, InfluxDB, Oxidized servers)
- 1 x Core Server (WebGUI, SQL DB, RRD, API)
- 3 x Core Dispatcher Pollers (Single poller group, managed via REDIS)
- 1 x DMZ Dispatcher Poller (Single poller group that is in a remote network)
Two of the core dispatchers are currently shutdown as part of my troubleshooting.
I’ve restarted every single server in the stack, including our REDIS server.
Output from validate.php on my servers.
Core - WebGUI, SQL DB, RRD | 16vCPU, 24GB RAM, 450GB VMDK (RAID 6 - 10k drives)
~$ sudo su - librenms
$ ./validate.php
Component Version LibreNMS 1.69-22-gcfd9dce62 DB Schema 2020_11_02_164331_add_powerstate_enum_to_vminfo (190) PHP 7.4.23 Python 3.6.9 MySQL 10.1.48-MariaDB-0ubuntu0.18.04.1 RRDTool 1.7.0 SNMP NET-SNMP 5.7.3 OpenSSL ====================================
[OK] Composer Version: 2.1.6
[OK] Dependencies up-to-date.
[OK] Database connection successful
[OK] Database schema correct
[WARN] Your install is over 24 hours out of date, last update: Wed, 11 Nov 2020 00:15:20 +0000
[FIX]:
Make sure your daily.sh cron is running and run ./daily.sh by hand to see if there are any errors.
Dispatcher Poller - In a cluster that uses a redis server | 8 vCPU, 8GB RAM, 30GB VMDK (RAID 6 - 10k drives)
~$ sudo su - librenms
$ ./validate.php
Component Version LibreNMS 1.69-22-gcfd9dce62 DB Schema 2020_11_02_164331_add_powerstate_enum_to_vminfo (190) PHP 7.4.12 Python 3.6.9 MySQL 10.1.48-MariaDB-0ubuntu0.18.04.1 RRDTool 1.7.0 SNMP NET-SNMP 5.7.3 OpenSSL ====================================
[OK] Composer Version: 2.1.7
[OK] Dependencies up-to-date.[OK] Database connection successful
[OK] Database schema correct
DMZ Dispatcher Poller - 6 vCPU, 4GB RAM, 30GB VMDK (RAID 6 - 10k drives)
~$ sudo su - librenms
$ ./validate.php
Component Version LibreNMS 1.69-22-gcfd9dce62 DB Schema 2020_11_02_164331_add_powerstate_enum_to_vminfo (190) PHP 7.4.12 Python 3.6.9 MySQL 10.1.48-MariaDB-0ubuntu0.18.04.1 RRDTool 1.7.0 SNMP NET-SNMP 5.7.3 OpenSSL ====================================
[OK] Composer Version: 2.1.7
[OK] Dependencies up-to-date.
[OK] Database connection successful
[OK] Database schema correct
Any information you can provide is greatly appreciated and please let me know if you require any further information from me.
Thanks in advance.