Bug Report - Librenms service randomly stopping

Validate.php output:

====================================
Component | Version
--------- | -------
LibreNMS  | 1.49-33-g6ec97309a
DB Schema | 2019_02_10_220000_add_dates_to_fdb (132)
PHP       | 7.2.15-0ubuntu0.18.04.1
MySQL     | 10.0.38-MariaDB-0ubuntu0.16.04.1
RRDTool   | 1.7.0
SNMP      | NET-SNMP 5.7.3
====================================

[OK]    Composer Version: 1.8.4
[OK]    Dependencies up-to-date.
[OK]    Database connection successful
[OK]    Database schema correct
[WARN]  Your install is over 24 hours out of date, last update: Sun, 10 Mar 2019 13:38:46 +0000
	[FIX]: 
	Make sure your daily.sh cron is running and run ./daily.sh by hand to see if there are any errors.
[WARN]  Your local git contains modified files, this could prevent automatic updates.
	[FIX]: 
	You can fix this with ./scripts/github-remove
	Modified Files:
	 LibreNMS/service.py
	 syslog.php
[FAIL]  Some folders have incorrect file permissions, this may cause issues.
	[FIX]: 
	sudo chown -R librenms:librenms /opt/librenms
	sudo setfacl -d -m g::rwx /opt/librenms/rrd /opt/librenms/logs /opt/librenms/bootstrap/cache/ /opt/librenms/storage/
	sudo chmod -R ug=rwX /opt/librenms/rrd /opt/librenms/logs /opt/librenms/bootstrap/cache/ /opt/librenms/storage/
	Files:
	 /opt/librenms/bootstrap/cache/packages.php

The service.py is just edited to knock out the alert manager portion to prevent duplicate alerts, nothing more. syslog.php is edited to handle custom traps for our Meraki appliances.

I’m running the libre polling service on 4 different pollers. Occasionally, like once a week or two, the service silently dies and stops polling. The librenms.service shows as running:

sudo systemctl status librenms
â—Ź librenms.service - LibreNMS SNMP Poller Service
   Loaded: loaded (/etc/systemd/system/librenms.service; enabled; vendor preset: enabled)
   Active: active (running) since Thu 2019-03-07 04:02:46 CST; 5 days ago
 Main PID: 1421 (python3)
    Tasks: 74 (limit: 4915)
   CGroup: /system.slice/librenms.service
           └─1421 /usr/bin/python3 /opt/librenms/librenms-service.py -v

Mar 12 08:56:00 librepoller librenms-service.py[1421]: Billing(INFO):Calculating billing complete
Mar 12 08:57:00 librepoller librenms-service.py[1421]: performance(INFO):Counting up time spent polling
Mar 12 08:57:00 librepoller librenms-service.py[1421]: Billing(INFO):Calculating billing
Mar 12 08:57:00 librepoller librenms-service.py[1421]: Billing(INFO):Calculating billing complete
Mar 12 08:58:00 librepoller librenms-service.py[1421]: performance(INFO):Counting up time spent polling
Mar 12 08:58:00 librepoller librenms-service.py[1421]: Billing(INFO):Calculating billing
Mar 12 08:58:00 librepoller librenms-service.py[1421]: Billing(INFO):Calculating billing complete
Mar 12 08:59:00 librepoller librenms-service.py[1421]: performance(INFO):Counting up time spent polling
Mar 12 08:59:00 librepoller librenms-service.py[1421]: Billing(INFO):Calculating billing
Mar 12 08:59:00 librepoller librenms-service.py[1421]: Billing(INFO):Calculating billing complete

The librenms.log just shows where it left off:

sudo tail -f logs/librenms.log
/opt/librenms/discovery.php 1991 2019-03-11 22:05:32 - 1 devices discovered in 3.822 secs
/opt/librenms/discovery.php 2015 2019-03-11 22:05:32 - 1 devices discovered in 2.587 secs
/opt/librenms/discovery.php 1993 2019-03-11 22:05:32 - 1 devices discovered in 3.991 secs
/opt/librenms/discovery.php 2028 2019-03-11 22:05:32 - 1 devices discovered in 4.304 secs
/opt/librenms/discovery.php 2025 2019-03-11 22:05:32 - 1 devices discovered in 2.858 secs
/opt/librenms/discovery.php 2030 2019-03-11 22:05:32 - 1 devices discovered in 4.327 secs
/opt/librenms/discovery.php 2023 2019-03-11 22:05:32 - 1 devices discovered in 3.017 secs
/opt/librenms/discovery.php 1992 2019-03-11 22:05:38 - 1 devices discovered in 30.53 secs
/opt/librenms/discovery.php 1986 2019-03-11 22:05:40 - 1 devices discovered in 12.84 secs
/opt/librenms/discovery.php 1912 2019-03-11 22:05:46 - 1 devices discovered in 22.84 secs

The redis server service shows as running, but had one complaint about the PID file about 4 days before the librenms-service.py had any issues:

sudo systemctl status redis-server.service 
â—Ź redis-server.service - Advanced key-value store
   Loaded: loaded (/lib/systemd/system/redis-server.service; enabled; vendor preset: enabled)
   Active: active (running) since Thu 2019-03-07 04:03:01 CST; 5 days ago
     Docs: http://redis.io/documentation,
           man:redis-server(1)
 Main PID: 1760 (redis-server)
    Tasks: 4 (limit: 4915)
   CGroup: /system.slice/redis-server.service
           └─1760 /usr/bin/redis-server 0.0.0.0:6379

Mar 07 04:02:46 librepoller systemd[1]: Starting Advanced key-value store...
Mar 07 04:03:00 librepoller systemd[1]: redis-server.service: Can't open PID file /var/run/redis/redis-server.pid (yet?) after start: No such file or directory
Mar 07 04:03:01 librepoller systemd[1]: Started Advanced key-value store.

I noticed in /var/log/syslog the following however:

Mar 12 06:40:00 librepoller librenms-service.py[1421]: performance(INFO):Counting up time spent polling
Mar 12 06:40:00 librepoller librenms-service.py[1421]: Billing(INFO):Polling billing
Mar 12 06:40:00 librepoller librenms-service.py[1421]: Billing(INFO):Polling billing complete

a lot, then the notable part was:

Mar 12 08:59:02 librepoller systemd[1]: Starting resolvconf-pull-resolved.service...
Mar 12 08:59:02 librepoller systemd[1]: Started resolvconf-pull-resolved.service.
Mar 12 08:59:02 librepoller dbus-daemon[1278]: [system] Reloaded configuration
Mar 12 08:59:12 librepoller kernel: [446255.504388] print_req_error: I/O error, dev fd0, sector 0
Mar 12 08:59:12 librepoller kernel: [446255.504468] floppy: error 10 while reading block 0
Mar 12 09:00:00 librepoller librenms-service.py[1421]: Billing(INFO):Polling billing
Mar 12 09:00:00 librepoller librenms-service.py[1421]: Billing(INFO):Polling billing complete
Mar 12 09:00:00 librepoller librenms-service.py[1421]: performance(INFO):Counting up time spent polling
Mar 12 09:00:00 librepoller librenms-service.py[1421]: Billing(INFO):Polling billing
Mar 12 09:00:01 librepoller librenms-service.py[1421]: Billing(INFO):Polling billing complete
Mar 12 09:01:00 librepoller librenms-service.py[1421]: performance(INFO):Counting up time spent polling
Mar 12 09:01:00 librepoller librenms-service.py[1421]: Billing(INFO):Calculating billing
Mar 12 09:01:00 librepoller librenms-service.py[1421]: Billing(INFO):Calculating billing complete
Mar 12 09:02:00 librepoller librenms-service.py[1421]: performance(INFO):Counting up time spent polling
Mar 12 09:02:00 librepoller librenms-service.py[1421]: Billing(INFO):Calculating billing
Mar 12 09:02:01 librepoller librenms-service.py[1421]: Billing(INFO):Calculating billing complete
Mar 12 09:02:05 librepoller systemd[1]: Stopping Advanced key-value store...
Mar 12 09:02:06 librepoller librenms-service.py[1421]: Poller_0-33(ERROR):Poller poller exception! Error while reading from socket: ('Connection closed by server.',)
Mar 12 09:02:06 librepoller librenms-service.py[1421]: Discovery_0-14(ERROR):Discovery poller exception! Error while reading from socket: ('Connection closed by server.',)
Mar 12 09:02:06 librepoller librenms-service.py[1421]: Poller_0-14(ERROR):Poller poller exception! Error while reading from socket: ('Connection closed by server.',)
Mar 12 09:02:06 librepoller librenms-service.py[1421]: Poller_0-2(ERROR):Poller poller exception! Error while reading from socket: ('Connection closed by server.',)
Mar 12 09:02:06 librepoller librenms-service.py[1421]: Poller_0-35(ERROR):Poller poller exception! Error while reading from socket: ('Connection closed by server.',)
Mar 12 09:02:06 librepoller librenms-service.py[1421]: Poller_0-29(ERROR):Poller poller exception! Error while reading from socket: ('Connection closed by server.',)
Mar 12 09:02:06 librepoller systemd[1]: Stopped Advanced key-value store.
Mar 12 09:02:06 librepoller systemd[1]: Starting Advanced key-value store...
Mar 12 09:02:06 librepoller librenms-service.py[1421]: Poller_0-5(ERROR):Poller poller exception! Error while reading from socket: ('Connection closed by server.',)
Mar 12 09:02:06 librepoller librenms-service.py[1421]: Discovery_0-29(ERROR):Discovery poller exception! Error while reading from socket: ('Connection closed by server.',)
Mar 12 09:02:06 librepoller librenms-service.py[1421]: Discovery_0-24(ERROR):Discovery poller exception! Error while reading from socket: ('Connection closed by server.',)

After some repeats of that, I then saw:

Mar 12 09:04:00 librepoller librenms-service.py[1421]: Exception in thread calculate_billing_timer:
Mar 12 09:04:00 librepoller librenms-service.py[1421]: Traceback (most recent call last):
Mar 12 09:04:00 librepoller librenms-service.py[1421]:   File "/usr/local/lib/python3.6/dist-packages/redis/connection.py", line 184, in _read_from_socket
Mar 12 09:04:00 librepoller librenms-service.py[1421]:     raise socket.error(SERVER_CLOSED_CONNECTION_ERROR)
Mar 12 09:04:00 librepoller librenms-service.py[1421]: OSError: Connection closed by server.
Mar 12 09:04:00 librepoller librenms-service.py[1421]: During handling of the above exception, another exception occurred:
Mar 12 09:04:00 librepoller librenms-service.py[1421]: Traceback (most recent call last):
Mar 12 09:04:00 librepoller librenms-service.py[1421]:   File "/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner
Mar 12 09:04:00 librepoller librenms-service.py[1421]:     self.run()
Mar 12 09:04:00 librepoller librenms-service.py[1421]:   File "/usr/lib/python3.6/threading.py", line 864, in run
Mar 12 09:04:00 librepoller librenms-service.py[1421]:     self._target(*self._args, **self._kwargs)
Mar 12 09:04:00 librepoller librenms-service.py[1421]:   File "/opt/librenms/LibreNMS/__init__.py", line 99, in _loop
Mar 12 09:04:00 librepoller librenms-service.py[1421]:     self.target()
Mar 12 09:04:00 librepoller librenms-service.py[1421]:   File "/opt/librenms/LibreNMS/service.py", line 410, in dispatch_calculate_billing
Mar 12 09:04:00 librepoller librenms-service.py[1421]:     self.billing_manager.post_work('calculate', 0)
Mar 12 09:04:00 librepoller librenms-service.py[1421]:   File "/opt/librenms/LibreNMS/queuemanager.py", line 74, in post_work
Mar 12 09:04:00 librepoller librenms-service.py[1421]:     self.get_queue(queue_id).put(payload)
Mar 12 09:04:00 librepoller librenms-service.py[1421]:   File "/opt/librenms/LibreNMS/__init__.py", line 265, in put
Mar 12 09:04:00 librepoller librenms-service.py[1421]:     self._redis.rpush(self.key, item)
Mar 12 09:04:00 librepoller librenms-service.py[1421]:   File "/usr/local/lib/python3.6/dist-packages/redis/client.py", line 1666, in rpush
Mar 12 09:04:00 librepoller librenms-service.py[1421]:     return self.execute_command('RPUSH', name, *values)
Mar 12 09:04:00 librepoller librenms-service.py[1421]:   File "/usr/local/lib/python3.6/dist-packages/redis/client.py", line 775, in execute_command
Mar 12 09:04:00 librepoller librenms-service.py[1421]:     return self.parse_response(connection, command_name, **options)
Mar 12 09:04:00 librepoller librenms-service.py[1421]:   File "/usr/local/lib/python3.6/dist-packages/redis/client.py", line 789, in parse_response
Mar 12 09:04:00 librepoller librenms-service.py[1421]:     response = connection.read_response()
Mar 12 09:04:00 librepoller librenms-service.py[1421]:   File "/usr/local/lib/python3.6/dist-packages/redis/connection.py", line 636, in read_response
Mar 12 09:04:00 librepoller librenms-service.py[1421]:     raise e
Mar 12 09:04:00 librepoller librenms-service.py[1421]:   File "/usr/local/lib/python3.6/dist-packages/redis/connection.py", line 633, in read_response
Mar 12 09:04:00 librepoller librenms-service.py[1421]:     response = self._parser.read_response()
Mar 12 09:04:00 librepoller librenms-service.py[1421]:   File "/usr/local/lib/python3.6/dist-packages/redis/connection.py", line 291, in read_response
Mar 12 09:04:00 librepoller librenms-service.py[1421]:     response = self._buffer.readline()
Mar 12 09:04:00 librepoller librenms-service.py[1421]:   File "/usr/local/lib/python3.6/dist-packages/redis/connection.py", line 223, in readline
Mar 12 09:04:00 librepoller librenms-service.py[1421]:     self._read_from_socket()
Mar 12 09:04:00 librepoller librenms-service.py[1421]:   File "/usr/local/lib/python3.6/dist-packages/redis/connection.py", line 198, in _read_from_socket
Mar 12 09:04:00 librepoller librenms-service.py[1421]:     (e.args,))
Mar 12 09:04:00 librepoller librenms-service.py[1421]: redis.exceptions.ConnectionError: Error while reading from socket: ('Connection closed by server.',)
Mar 12 09:04:00 librepoller librenms-service.py[1421]: performance(INFO):Counting up time spent polling
Mar 12 09:04:00 librepoller librenms-service.py[1421]: performance(ERROR):Unable to log performance statistics - is the database still online?
Mar 12 09:04:00 librepoller librenms-service.py[1421]: Traceback (most recent call last):
Mar 12 09:04:00 librepoller librenms-service.py[1421]:   File "/usr/local/lib/python3.6/dist-packages/redis/connection.py", line 184, in _read_from_socket
Mar 12 09:04:00 librepoller librenms-service.py[1421]:     raise socket.error(SERVER_CLOSED_CONNECTION_ERROR)
Mar 12 09:04:00 librepoller librenms-service.py[1421]: OSError: Connection closed by server.
Mar 12 09:04:00 librepoller librenms-service.py[1421]: During handling of the above exception, another exception occurred:
Mar 12 09:04:00 librepoller librenms-service.py[1421]: Traceback (most recent call last):
Mar 12 09:04:00 librepoller librenms-service.py[1421]:   File "/opt/librenms/LibreNMS/service.py", line 721, in log_performance_stats
Mar 12 09:04:00 librepoller librenms-service.py[1421]:     sum([getattr(self, ''.join([worker_type, '_manager'])).get_queue(group).qsize() for group in self.config.group]),
Mar 12 09:04:00 librepoller librenms-service.py[1421]:   File "/opt/librenms/LibreNMS/service.py", line 721, in <listcomp>
Mar 12 09:04:00 librepoller librenms-service.py[1421]:     sum([getattr(self, ''.join([worker_type, '_manager'])).get_queue(group).qsize() for group in self.config.group]),
Mar 12 09:04:00 librepoller librenms-service.py[1421]:   File "/opt/librenms/LibreNMS/__init__.py", line 254, in qsize
Mar 12 09:04:00 librepoller librenms-service.py[1421]:     return self._redis.llen(self.key)
Mar 12 09:04:00 librepoller librenms-service.py[1421]:   File "/usr/local/lib/python3.6/dist-packages/redis/client.py", line 1603, in llen
Mar 12 09:04:00 librepoller librenms-service.py[1421]:     return self.execute_command('LLEN', name)
Mar 12 09:04:00 librepoller librenms-service.py[1421]:   File "/usr/local/lib/python3.6/dist-packages/redis/client.py", line 775, in execute_command
Mar 12 09:04:00 librepoller librenms-service.py[1421]:     return self.parse_response(connection, command_name, **options)
Mar 12 09:04:00 librepoller librenms-service.py[1421]:   File "/usr/local/lib/python3.6/dist-packages/redis/client.py", line 789, in parse_response
Mar 12 09:04:00 librepoller librenms-service.py[1421]:     response = connection.read_response()
Mar 12 09:04:00 librepoller librenms-service.py[1421]:   File "/usr/local/lib/python3.6/dist-packages/redis/connection.py", line 636, in read_response
Mar 12 09:04:00 librepoller librenms-service.py[1421]:     raise e
Mar 12 09:04:00 librepoller librenms-service.py[1421]:   File "/usr/local/lib/python3.6/dist-packages/redis/connection.py", line 633, in read_response
Mar 12 09:04:00 librepoller librenms-service.py[1421]:     response = self._parser.read_response()
Mar 12 09:04:00 librepoller librenms-service.py[1421]:   File "/usr/local/lib/python3.6/dist-packages/redis/connection.py", line 291, in read_response
Mar 12 09:04:00 librepoller librenms-service.py[1421]:     response = self._buffer.readline()
Mar 12 09:04:00 librepoller librenms-service.py[1421]:   File "/usr/local/lib/python3.6/dist-packages/redis/connection.py", line 223, in readline
Mar 12 09:04:00 librepoller librenms-service.py[1421]:     self._read_from_socket()
Mar 12 09:04:00 librepoller librenms-service.py[1421]:   File "/usr/local/lib/python3.6/dist-packages/redis/connection.py", line 198, in _read_from_socket
Mar 12 09:04:00 librepoller librenms-service.py[1421]:     (e.args,))
Mar 12 09:04:00 librepoller librenms-service.py[1421]: redis.exceptions.ConnectionError: Error while reading from socket: ('Connection closed by server.',)

Restarting via sudo systemctl restart librenms on ONE of the pollers successfully kicks polling off on all four servers. Nothing else needs to be touched to get polling to kick back off successfully.

I’m only running one redis instance, and it isn’t clustered. I’ll be trying that later to see if that helps provide more stability. Other than that, any ideas as to what may be causing my problem?

Thanks much-

** Edit - Here is the portion of the logs from /var/log/syslog around the failure time:

Mar 11 22:05:32 librepoller librenms-service.py[1421]: Discovery_0-19(INFO):Discovery complete 2025
Mar 11 22:05:32 librepoller librenms-service.py[1421]: Discovery_0-13(INFO):Discovery complete 2030
Mar 11 22:05:32 librepoller librenms-service.py[1421]: Discovery_0-16(INFO):Discovery complete 2023
Mar 11 22:05:38 librepoller librenms-service.py[1421]: Discovery_0-21(INFO):Discovery complete 1992
Mar 11 22:05:41 librepoller librenms-service.py[1421]: Discovery_0-23(INFO):Discovery complete 1986
Mar 11 22:05:46 librepoller librenms-service.py[1421]: Discovery_0-15(INFO):Discovery complete 1912
Mar 11 22:06:00 librepoller librenms-service.py[1421]: Billing(INFO):Calculating billing
Mar 11 22:06:00 librepoller librenms-service.py[1421]: Billing(INFO):Calculating billing complete
Mar 11 22:06:00 librepoller librenms-service.py[1421]: performance(INFO):Counting up time spent polling
Mar 11 22:07:00 librepoller librenms-service.py[1421]: Billing(INFO):Calculating billing
Mar 11 22:07:00 librepoller librenms-service.py[1421]: Billing(INFO):Calculating billing complete
Mar 11 22:07:00 librepoller librenms-service.py[1421]: performance(INFO):Counting up time spent polling
Mar 11 22:08:00 librepoller librenms-service.py[1421]: Billing(INFO):Calculating billing
Mar 11 22:08:00 librepoller librenms-service.py[1421]: Billing(INFO):Calculating billing complete
Mar 11 22:08:00 librepoller librenms-service.py[1421]: performance(INFO):Counting up time spent polling
Mar 11 22:09:00 librepoller librenms-service.py[1421]: Billing(INFO):Calculating billing
Mar 11 22:09:00 librepoller librenms-service.py[1421]: Billing(INFO):Calculating billing complete
Mar 11 22:09:00 librepoller librenms-service.py[1421]: performance(INFO):Counting up time spent polling
Mar 11 22:09:01 librepoller CRON[31489]: (root) CMD (  [ -x /usr/lib/php/sessionclean ] && if [ ! -d /run/systemd/system ]; then /usr/lib/php/sessionclean; fi)
Mar 11 22:09:03 librepoller systemd[1]: Starting Clean php session files...
Mar 11 22:09:03 librepoller systemd[1]: Started Clean php session files.
Mar 11 22:10:00 librepoller librenms-service.py[1421]: Billing(INFO):Calculating billing
Mar 11 22:10:00 librepoller librenms-service.py[1421]: Billing(INFO):Calculating billing complete
Mar 11 22:10:00 librepoller librenms-service.py[1421]: Billing(INFO):Polling billing
Repeat every minute until...
Mar 11 22:59:00 librepoller librenms-service.py[1421]: performance(INFO):Counting up time spent polling
Mar 11 23:00:00 librepoller librenms-service.py[1421]: Exception in thread Thread-6:
Mar 11 23:00:00 librepoller librenms-service.py[1421]: Traceback (most recent call last):
Mar 11 23:00:00 librepoller librenms-service.py[1421]:   File "/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner
Mar 11 23:00:00 librepoller librenms-service.py[1421]:     self.run()
Mar 11 23:00:00 librepoller librenms-service.py[1421]:   File "/usr/lib/python3.6/threading.py", line 864, in run
Mar 11 23:00:00 librepoller librenms-service.py[1421]:     self._target(*self._args, **self._kwargs)
Mar 11 23:00:00 librepoller librenms-service.py[1421]:   File "/opt/librenms/LibreNMS/__init__.py", line 99, in _loop
Mar 11 23:00:00 librepoller librenms-service.py[1421]:     self.target()
Mar 11 23:00:00 librepoller librenms-service.py[1421]:   File "/opt/librenms/LibreNMS/service.py", line 348, in dispatch_discovery
Mar 11 23:00:00 librepoller librenms-service.py[1421]:     devices = self.fetch_device_list()
Mar 11 23:00:00 librepoller librenms-service.py[1421]:   File "/opt/librenms/LibreNMS/service.py", line 466, in fetch_device_list
Mar 11 23:00:00 librepoller librenms-service.py[1421]:     return self._discovery_db.query("SELECT `device_id`, `poller_group` FROM `device$
Mar 11 23:00:00 librepoller librenms-service.py[1421]:   File "/opt/librenms/LibreNMS/__init__.py", line 82, in query
Mar 11 23:00:00 librepoller librenms-service.py[1421]:     cursor.execute(query, args)
Mar 11 23:00:00 librepoller librenms-service.py[1421]:   File "/usr/local/lib/python3.6/dist-packages/pymysql/cursors.py", line 170, in exe$
Mar 11 23:00:00 librepoller librenms-service.py[1421]:     result = self._query(query)
Mar 11 23:00:00 librepoller librenms-service.py[1421]:   File "/usr/local/lib/python3.6/dist-packages/pymysql/cursors.py", line 328, in _qu$
Mar 11 23:00:00 librepoller librenms-service.py[1421]:     conn.query(q)
Mar 11 23:00:00 librepoller librenms-service.py[1421]:   File "/usr/local/lib/python3.6/dist-packages/pymysql/connections.py", line 517, in$
Mar 11 23:00:00 librepoller librenms-service.py[1421]:     self._affected_rows = self._read_query_result(unbuffered=unbuffered)
Mar 11 23:00:00 librepoller librenms-service.py[1421]:   File "/usr/local/lib/python3.6/dist-packages/pymysql/connections.py", line 732, in$
Mar 11 23:00:00 librepoller librenms-service.py[1421]:     result.read()
Mar 11 23:00:00 librepoller librenms-service.py[1421]:   File "/usr/local/lib/python3.6/dist-packages/pymysql/connections.py", line 1075, i$
Mar 11 23:00:00 librepoller librenms-service.py[1421]:     first_packet = self.connection._read_packet()
Mar 11 23:00:00 librepoller librenms-service.py[1421]:   File "/usr/local/lib/python3.6/dist-packages/pymysql/connections.py", line 657, in$
Mar 11 23:00:00 librepoller librenms-service.py[1421]:     packet_header = self._read_bytes(4)
Mar 11 23:00:00 librepoller librenms-service.py[1421]:   File "/usr/local/lib/python3.6/dist-packages/pymysql/connections.py", line 707, in$
Mar 11 23:00:00 librepoller librenms-service.py[1421]:     CR.CR_SERVER_LOST, "Lost connection to MySQL server during query")
Mar 11 23:00:00 librepoller librenms-service.py[1421]: pymysql.err.OperationalError: (2013, 'Lost connection to MySQL server during query')
Mar 11 23:00:00 librepoller librenms-service.py[1421]: Billing(INFO):Polling billing
Mar 11 23:00:00 librepoller librenms-service.py[1421]: Billing(INFO):Polling billing complete

Could it be related to this ? https://github.com/andymccurdy/redis-py/issues/1127

I think librenms uses redis-py for distributed polling, so you could check what version you are on apparently something is fixed in 3.2.0.

You could also experiment with some of the retry_on_timeout and timeout suggestions.

Thanks Chas. I’m upgrading from 3.1.0 to 3.2.0 on my pollers. I’ll see if that helps!

No luck with 3.2.0. Again the pollers all died with the same messages. I’ve decided to try a less elegant solution.
This line is constantly repeated in the logs when she hits a failure state. I’m going to try simply tossing an exit() after it. Theoretically the systemd unit for the service should then automagically restart the service on its own.

If you haven’t heard back from me for two weeks, you can probably safely assume this workaround was successful for me.

2 Likes

For anyone that comes across this, I still haven’t gotten a working solution. exit() didn’t work, and neither did self.restart(). I’ve been trying some redis timeout setting tweaks, but so far haven’t found a winning combo to resolve the issue.

Any other suggestions are welcome.

I havent had any luck with tweaking redis timeouts, or letting the systemd unit restart the script after forcing it exit on a fail condition… What I did have was luck with writing my own script to handle it. For those curious:

mkdir libreservice && cd libreservice
sudo apt-get -y install python3 python3-venv python3-dev git
python3 -m venv venv
source venv/bin/activate
pip3 install requests python-dotenv
Optional - nano .env #If you want notifications sent to slack. Otherwise delete all mentions of it in the script.

SLACKHOOK=<your slack webhook url>
SLACKUSERNAME=<slack username you prefer to use>
SLACKCHANNEL=<slack channel you want notifications sent to>
SLACKTITLEURL=<url/of/your/libre/webfrontend>/pollers/tab=log/filter=unpolled/

nano restartlibre.py and populate with the following:

#/path/to/your/dir/libreservice/venv/bin/python3

import subprocess, os, time
import requests, json
from time import sleep
from datetime import datetime
from dotenv import load_dotenv

fpath = os.getcwd()
load_dotenv(os.path.join(fpath, '.env'))

class bcolors:
    # Used purely to change colors of printed messages.
    HEADER = '\033[95m' #purple
    OKBLUE = '\033[94m'
    OKGREEN = '\033[92m'
    WARNING = '\033[93m' #yellow
    FAIL = '\033[91m' #red
    ENDC = '\033[0m' #clears back to normal
    BOLD = '\033[1m'
    UNDERLINE = '\033[4m'


def check_if_running():
    check = subprocess.Popen("ps -eo pcpu,args | grep librenms-service.py | sort -k1 -r -n | head -1",
        shell=True, stdout=subprocess.PIPE).stdout
    output = check.read()
    clean = output.decode().strip()
    split = int(float(clean.split(' ')[0]))
    if split is not None and split > 0:
        return True
    else:
        return False


def restart_librenms_service():
    status = check_if_running()
    tries = 0
    while status == False:
        restart = subprocess.Popen("systemctl restart librenms",
            shell=True, stdout=subprocess.PIPE).stdout
        sleep(180)
        status = check_if_running()
        tries += 1
        if tries > 3:
            break
    if status == False:
        print(bcolors.FAIL+"Tried 4x to restart the service to no avail."+bcolors.ENDC)
        pretext = "The librenms service failed to be restarted."
        title = "Click here to see the unpolled devices log."
        message = "Tried to 4x to restart the service via script with no luck. Will require manual intervention."
        sendtoslack(pretext, title, message, 'bad') #remove if not using slack
    else:
        print(bcolors.OKGREEN+"Successfully restarted the service after "+str(tries)+ " tries."+bcolors.ENDC)
        sendtoslack("Librenms service successfully restarted.", "Poller Log", "Script successfully restarted the service", 'good') # remove if not using slack


def sendtoslack(pretext, title, slackmessage, status): #skip if you dont want slack notifications
    if status == 'good':
        color = "#41a636"
    elif status == 'bad':
        color = "#a63636"
    else:
        color = "#7a7a7a"
    webhook_url = os.environ.get('SLACKHOOK')
    iconurl = 'https://avatars2.githubusercontent.com/u/5791783?s=400&v=4'
    payload = {
        "channel": os.environ.get('SLACKCHANNEL'),
        "username": os.environ.get('SLACKUSERNAME'),
        "icon_url": iconurl,
        "attachments": [
            {
                "fallback": slackmessage,
                "color": color,
                "pretext": pretext,
                "author_name": os.environ.get('SLACKUSERNAME'),
                "author_link": os.environ.get('SLACKTITLEURL'),
                "author_icon": iconurl,
                "title": title,
                "title_link": os.environ.get('SLACKTITLEURL'),
                "text": slackmessage,
                "fields": [
                    {
                        "title": "Priority",
                        "value": "High",
                        "short": 'false'
                    }
                ],
                "thumb_url": iconurl,
                "footer": os.environ.get('SLACKUSERNAME'),
                "footer_icon": iconurl
            }
        ]
    }
    response = requests.post(
        webhook_url, data=json.dumps(payload),
        headers={'Content-Type': 'application/application/json'}
    )
    if response.status_code != 200:
        raise ValueError(
            'Request to slack returned an error %s, the response is:\n%s'
            % (response.status_code, response.text)
        )
    return response


def main():
    try:
        scriptstarttime = time.time()
        runstatus = check_if_running()
        if runstatus == True:
            print(bcolors.OKGREEN+"Librenms-service is running. Doing Nothing."+bcolors.ENDC)
        else:
            print(bcolors.WARNING+"Librenms-service isn't running. Attempting to restart it."+bcolors.ENDC)
            restart_librenms_service()
        scriptendtime = time.time()
        print(bcolors.OKBLUE+"[!] Script completed in : %.2f seconds" % float(scriptendtime-scriptstarttime)+bcolors.ENDC)
        exit()
    except KeyboardInterrupt:
        print(bcolors.FAIL+"\n[!!!] Ctrl + C Detected!"+bcolors.ENDC)
        print(bcolors.FAIL+"[XXX] Exiting script now.."+bcolors.ENDC)
        exit()

if __name__ == '__main__':
    try:
        main()
    except KeyboardInterrupt:
        print(bcolors.FAIL+"\n[!!!] Ctrl + C Detected!"+bcolors.ENDC)
        print(bcolors.FAIL+"[XXX] Exiting script now.."+bcolors.ENDC)
        exit()

Then create the cronjob for it.
sudo crontab -e
*/5 * * * * /path/to/dir/libreservice/venv/bin/python3 /path/to/dir/libreservice/restartlibre.py
Having this on just one of my pollers has been working to keep all four pollers up and running and not hitting the connection failed state. Hope this helps someone else.

- example slack notification output.