Issue with polled services not populating graph data

I have an interesting problem. I have a three node librenms distributed poller setup with dispatcher service, and the nodes pointing to a galera database backend. The latter could be relevant… We have a functioning redis to keep it all working as per the distributed poller setup, validate.php says things are great and everything seems to be great. We have different groups and different devices are only being polled by designated pollers etc, we know its resilient and we can rely on it. We are not seeing any negative logging indicating lock issues or anything that could perhaps be redis related, nor with the database (to date).

The problem appears to be with services. We have only just recently enabled the services functionality in the nav bar and specified the path to the nagios plugins, and on the face of it when adding a service for a device, e.g. a simple curl, the eventually wakes up and goes green. If we stop the service being checked, it goes red. So we know the service is actually being actively polled and corss verified with a packet dump. It even matched alert rules. The issue that under the “details” for the service I see empty graphs.

I have a very similar replica environment that is just one poller at the moment, I havent quite got to the stage of adding in a second poller to perform some more testing. that environment also has a galera backend and has distributed polling enabled, its just not used as its on its own. Services added to this stack work just like in our main environment, but they DO have populating graphs.

I have troubleshooted this quite a bit today but I can’t put my finger on what the issue might be. I’ve checked its not a trivial rrdcached issue, other “regular” stats via snmp are being graphed with no issue. Its just the services.

If we attempt to run the service-wrapper.py manually (bearing in mind we are using dispatcher, I would have though as this mention services we dont need a cron), this performs a poll and also updates the graphs.

Reaching out the community if there is any pointers anyone may be able to offer to troubleshoot this.

Are these set? (note mine is set to ‘false’)

$ lnms config:get service_services_enabled
false
$ lnms config:get service_services_workers 
8
$ lnms config:get service_services_frequency
300

also if you change these settings you need to restart the dispatcher service for it to take effect

Thanks for the pointers. I have checked using those commands (on all three pollers) and I see true, 8 and 300, which appear to be the defaults.

I have done some more troubleshooting, it would appears the services are being checked, they are also updating the RRD files (e.g. /opt/librenms/rrd/devicename/services-nn.rrd), we can tail the RRDcache journal and see update commands, we see rrdcached performing a write to the file, and rrdtool dump shows the inserts based on their epoch times, yet they are all NaNs apart from the times that we manually run check_services.php (where real data does get added)

Still troubleshooting, but completely run out of time today, so I’ll keep things posted when I know some more.

Another symptom between the working lab and this problem one, is that under “actioned” in the distributed poller stats, we see 0 for the services field. The lab shows the correct number of services. this temporarily shows the right number but only after manually restarting librenms-service.

so, after some more troubleshooting I stopped the librenms-services on two pollers, they both went red, the remaining one started to populate the RRD properly with real data. Re-enabling the additonal pollers didnt regress anything. Not too sure what that fixed if I’m honest, and I wish I did that step beforehand. Anyway, all sorted now.