Issues with service checks (using Nagios plugins) - service flapping/multiple events per polling cycle

I recently enabled Services (using Nagios plugins) to be able to monitor Cisco Smart Licensing since it’s not natively supported in LibreNMS. I’m currently running LibreNMS 21.12.1 with patches from PRs 13663 and 14532 applied (I realize it’s old), in a distributed setup with rrdcached and redis, using the dispatcher service. I’m using the check_nwc_health plugin, installed on master server and all pollers that are currently being used for testing.

If I run the check manually (as librenms user), everything works as expected (from both pollers):

[librenms@distributedpoller01 ~]$ /usr/lib64/nagios/plugins/check_nwc_health -H <removed> --mode check-licenses --community <removed> --warning 30 --critical 15; echo $?       
OK - smart licensing is not enabled
0
[librenms@distributedpoller02 ~]$ /usr/lib64/nagios/plugins/check_nwc_health -H <removed> --mode check-licenses --community <removed> --warning 30 --critical 15; echo $?       
OK - smart licensing is not enabled
0

However, when I configure the service on a host, I’m constantly seeing the service state flapping, and multiple events per polling cycle:

2024-07-22 02:20:03	service	<removed>	Service 'nwc_health' changed status from OK to Critical - Check the installed licences/keys -
2024-07-22 02:20:03	service	<removed>	Service 'nwc_health' changed status from OK to Critical - Check the installed licences/keys -
2024-07-22 02:20:03	service	<removed>	Service 'nwc_health' changed status from OK to Critical - Check the installed licences/keys -
2024-07-22 02:20:03	service	<removed>	Service 'nwc_health' changed status from OK to Critical - Check the installed licences/keys -
2024-07-22 02:20:03	service	<removed>	Service 'nwc_health' changed status from OK to Critical - Check the installed licences/keys -
2024-07-22 02:20:02	service	<removed>	Service 'nwc_health' changed status from Critical to OK - Check the installed licences/keys - OK - smart licensing is not enabled

It always seems to start every polling cycle with one successful poll, and one or more failed “polls” in the event log.

If I run ./check-services.php -d, it appears to work correctly (despite a couple warnings about undefined array keys that seem to be related to the fact there’s no performance data from the plugin):

[librenms@distributedpoller0 ~]$ ./check-services.php -d
DEBUG!
Starting service polling run:

SQL[SELECT D.*,S.*,attrib_value  FROM `devices` AS D INNER JOIN `services` AS S ON S.device_id = D.device_id AND D.disabled = 0  LEFT JOIN `devices_attribs` as A ON D.device_id 
= A.device_id AND A.attrib_type = "override_icmp_disable" ORDER by D.device_id DESC; [] 8.76ms] 


Nagios Service - 3
Request:  '/usr/lib64/nagios/plugins/check_nwc_health' '-H' '<removed>' '--mode' 'check-licenses' '--community' '<removed>' '--warning' '30' '--critical' '15' '-t' '60'

Warning: Undefined array key 1 in /opt/librenms/includes/services.inc.php on line 242

Warning: Undefined array key 1 in /opt/librenms/includes/services.inc.php on line 253
Perf Data - None.
Response: OK - smart licensing is not enabled
./check-services.php 2024-07-23 05:14:07 - 1 services polled in 1.094 secs

In an attempt to debug this, I wrote a quick wrapper around the Nagios plugin to get a tiny bit of extra logging.

check_licenses_wrapper.sh:

#!/bin/bash

TIMESTAMP=`date +%s%N | cut -b1-13`
FILENAME="/tmp/check_licenses-$TIMESTAMP.debug"

IFS=" " read -r -a ARGS <<< "$*"

OUTPUT=$(/usr/lib64/nagios/plugins/check_nwc_health "${ARGS[@]}" 2>&1)
RETURN=$?

echo "ARGS: $*" >> $FILENAME
echo "OUTPUT: $OUTPUT" >> $FILENAME
echo "RETURN: $RETURN" >> $FILENAME

echo $OUTPUT
exit $RETURN

I then removed the old service, added a new one using the wrapper and waited to see which files were created (it creates a new “debug” file per each time it’s executed):

[librenms@distributedpoller01 ~]$ ls /tmp/check_licenses-*
ls: cannot access '/tmp/check_licenses-*': No such file or directory
[librenms@distributedpoller02 ~]$ ls -alh /tmp/check_licenses-*
-rw-r--r--. 1 librenms librenms 153 Jul 22 07:35 /tmp/check_licenses-1721633701255.debug

Check the contents of the single debug file:

[librenms@distributedpoller02 ~]$ cat /tmp/check_licenses-1721633701255.debug
ARGS: -H <removed> --mode check-licenses --community <removed> --warning 30 --critical 15 -t 60
OUTPUT: OK - smart licensing is not enabled
RETURN: 0

So the check is running only once per polling cycle (and it’s sucessful), but I’m getting mutiple events per polling cycle.

Does anyone know how to go about troubleshooting this? I’m nearly at my wits’ end trying to figure this out.

My ./validate.php:

./validate.php
====================================
Component | Version
--------- | -------
LibreNMS  | 21.12.1
DB Schema | 2021_02_09_122930_migrate_to_utf8mb4 (233)
PHP       | 8.0.30
Python    | 3.9.18
MySQL     | 10.11.5-MariaDB
RRDTool   | 1.7.2
SNMP      | 5.9.1
====================================

[OK]    Composer Version: 2.7.7
[OK]    Dependencies up-to-date.
[OK]    Database connection successful
[OK]    Database schema correct
[INFO]  Detected Dispatcher Service

One thing I found that’s interesting is:

MariaDB [librenms]> SELECT service_status, count(service_status) as count FROM services WHERE device_id = 929;
+----------------+-------+
| service_status | count |
+----------------+-------+
|            127 |     1 |
+----------------+-------+
1 row in set (0.001 sec)

If that’s an exit code, I believe it’s used to indicate command not found. That would explain why we only see one debug file from the wrapper script (we only see when it’s successful), but doesn’t explain where it’s trying to run the script from that it’s not finding it.

I’ve never seen this particular plug-in before. Can you help us understand what this should look like? It seems odd that we’re running a plug-in to check licensing but it returns OK for smart licensing is not enabled. Is that OK?

It doesn’t look like very good Nagios output. Services in LibreNMS are typically looking for perf type data from plug-in that’s usually formatted something like:

STATUS: [value] [description]|'label'=value[UOM];[warn];[crit];[min];[max]

Example:

>check_tcp.exe -H somehost -p someport -w 100 -c 200
TCP OK - 0.000 second response time on port 22 |time=0.000483s;100;200;;

Where your polling output shows Perf Data - None. leads me to believe there isn’t enough data being output there for LibreNMS to parse for charts.

The output from the plugin is perfectly legal and correct. Nagios plugins don’t have to include performance data (the part after the |), and as we see from my debug output, the return code (0) is correct as well.

The issue is the subsequent lines:

2024-07-22 02:20:03	service	<removed>	Service 'nwc_health' changed status from OK to Critical - Check the installed licences/keys -
2024-07-22 02:20:03	service	<removed>	Service 'nwc_health' changed status from OK to Critical - Check the installed licences/keys -
2024-07-22 02:20:03	service	<removed>	Service 'nwc_health' changed status from OK to Critical - Check the installed licences/keys -
2024-07-22 02:20:03	service	<removed>	Service 'nwc_health' changed status from OK to Critical - Check the installed licences/keys -
2024-07-22 02:20:03	service	<removed>	Service 'nwc_health' changed status from OK to Critical - Check the installed licences/keys -

where there is no text output. If the status code in the database (127) is to be believed, it’d be because LibreNMS can’t find the script. That being said, I have no idea where those events are coming from. I double checked that the host is correctly matched to this poller group, and there’s only two pollers in the poller group.

To clarify, the plugin checks for a number of licensing issues (unable to register, authorization expired, trial expired, etc). The Smart Licensing not being enabled is indeed an OK condition, I’m just testing it against this router before I roll it out to a few thousand routers.

Okay, so your --warn and --critical arguments don’t do anything?

I like the idea of the wrapper to troubleshoot, but I don’t quite understand the logic here…

That would explain why we only see one debug file from the wrapper script (we only see when it’s successful)

Why would we not see a log written by the wrapper every 5min (or whatever your polling interval is) pass or fail? Has the wrapper been created and placed in the nagios_plugins directory on both the main WebUI server and the two pollers? Is the nagios_plugins directory the same everywhere?

Blockquote
Okay, so your --warn and --critical arguments don’t do anything?

Not in this particular case, no.

Blockquote
Why would we not see a log written by the wrapper every 5min (or whatever your polling interval is) pass or fail?

The problem is I am seeing one (and only one) log from the wrapper every 5 minutes, but there are 2 or more events generated every 5 minutes (6 in the case above - 1 where wrapper actually writes output, and 5 more “phantom” events where no output is passed at all). The problem is the 1 or more phantom events (in the same 5 minute polling cycle) that I have no idea where they have come from, definitely don’t actually call the wrapper (otherwise we would have a /tmp/check_licenses-<timestamp>.debug file)

Blockquote
Has the wrapper been created and placed in the nagios_plugins directory on both the main WebUI server and the two pollers?

Yes. We have about 40 pollers, but only two pollers in this particular poller group. Both the wrapper and the check_nwc_health plugin are located in both, as well as the main WebUI box.

What if you temporarily changed this logging line on your two expected pollers?

Make it something like…

“Service ‘{$service[‘service_type’]}’ (poller01) changed status from $old_status_text to $new_status_text - {$service[‘service_desc’]} - $msg”,

Might be slightly different on yours since you have the older patched versions. But it might help track down the phantom events. You’d at least know if it’s running extras on one of those two pollers or if the added text isn’t there you’d know it’s coming from somewhere else.

I had actually done something similar but taken a different approach.

I put:

    $myfile = fopen("/tmp/newfile.txt", "a") or die("Unable to open file!");                                                                                                                                                                                                                                                                                     97/109 
    fwrite($myfile, date("h:i:sa") . " OLD STATUS: " . $old_status . "\n");                                                                                                                                                                                                                                                                                      98/109 
    fwrite($myfile, date("h:i:sa") . " NEW STATUS: " . $new_status . "\n");                                                                                                                                                                                                                                                                                      99/109 
    fclose($myfile);    

right above the:

    if ($old_status != $new_status) {    

But NEW STATUS is always 0, indicating that polls from this poller were always successful. I will try your approach as well though as I’m completely out of ideas at this point, and don’t know the architecture of LibreNMS well enough to hunt this down further.

@slashdoom,

So based on your suggestion, I went and modified service.inc.php on all my pollers. I was able to narrow down the issue to 6 legacy pollers that, when the setup was moved to the dispatcher service, no one had gone back and reconfigured them to use dispatcher, so they were still using cron jobs.

This seemed to cause them to respond to service checks for all poller groups. I suspect the 127 exit code I was seeing in the DB was because they didn’t have the nagios plugin installed.

I migrated them all to use the dispatcher service, and the problems with the service check seem to have gone away. I’ll keep an eye on it for a bit, but it looks like the problem is solved. It seems mixing dispatcher pollers and cron pollers is a Really Bad Idea™.

Thank you very much for your help with this, I’ve been bashing my head against this for a while and was quite frankly losing hope.

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.