Librenms-service.py does not try to reconnect to Redis Sentinel after Redis cache failure

willhseitz · 5 March 2020 23:42

I recently configured Redis Sentinel to support distributed redis and failover, however I noticed an issue with the librenms-service.py polling service.

When a polling service is connected to a Redis cache which is received from a Redis Sentinel, and then the Redis cache becomes unavailable (network partition, host shutdown, etc), it seems that the polling service does not “re-ask” Redis Sentinel for another Redis cache, and continues to try to connect to the failed server.

The polling service eventually times out with a python “redis connection failed” error in syslog, and does not attempt to poll devices anymore. Here is a paste of the errors: https://p.libren.ms/view/8c7c6029. Restarting the librenms polling service does resolve the issue, as the service “asks” for another cache from Redis Sentinel, which sends it a health cache instance.

I’m curious how much extra resources it would cost to ask Sentinel for a fresh Redis instance before every job (polling, discovery, alerts, etc). If too much, maybe set up a watchdog that would check health (and master/slave status) of connected Redis server, and if check fails, re-ask Sentinel for a new instance.