Group of False Positives, Reachable but wont Transport Recovery

I recently had a bunch of monitored devices Alert for various reasosn, GPS Sync Drop, ICMP drop. However I can ping these devices just fine and GPS Sync is good. I can even ping these devices from the server with no issues.

SNMP walks work as well. Same with the Poller and Discovery debug. I’ve tried rebooting the server and the pollers and this has made no difference.

I was wondering if anyone has ran into this and what I can do to clear this. I’m worried that if these ACKd nodes actually go down I will no be notified.

What alert rules are using? Post them also post your validation by running validate.php

I posted the Alert Configs etc but found out that these are unrelated to my issue so I removed them from the thread. I had a more experienced tech look at this and it would appear that ever since an automatic update LibreNMS is not seeing some of our modules. The ones specifically for GPS timing and AP Traffic monitoring. We confirmed that the modules are still located in the directories.

the validate still helps

====================================

Component Version
LibreNMS 1.52-70-gf3ba8947f
DB Schema 2019_05_30_225937_device_groups_rewrite (135)
PHP 7.2.14-1+0~20190205200805.15+stretch~1.gbpd83c69
MySQL 10.1.26-MariaDB-0+deb9u1
RRDTool 1.6.0
SNMP NET-SNMP 5.7.3

====================================

[OK] Composer Version: 1.8.6
[OK] Dependencies up-to-date.
[OK] Database connection successful
[FAIL] Database: extra table (vw_alertlog_updown)
[FAIL] We have detected that your database schema may be wrong, please report the following to us on Discord (https://t.libren.ms/discord) or the community site (https://t.libren.ms/5gscd):
[FIX]:
Run the following SQL statements to fix.
SQL Statements:
DROP TABLE vw_alertlog_updown;
[FAIL] The poller (cerento012) has not completed within the last 5 minutes, check the cron job.
[FAIL] Discovery has not completed in the last 24 hours.
[FIX]:
Check the cron job to make sure it is running and using discovery-wrapper.py
[WARN] Your local git contains modified files, this could prevent automatic updates.
[FIX]:
You can fix this with ./scripts/github-remove
Modified Files:
includes/definitions/discovery/pmp.yaml

It looks like LibreNMS is unable to find the files placed in the custom plugin director. The permissions on the files are set top read/execute and the files are intact. This starting happening after an update on 6/23/2019

librenms@cerento010:~$ ./check-services.php -d
DEBUG!
Starting service polling run:

SQL[SELECT D.,S.,attrib_value FROM devices AS D INNER JOIN services AS S ON S.device_id = D.device_id AND D.disabled = 0 LEFT JOIN devices_attribs as A ON D.device_id = A.device_id AND A.attrib_type = “override_icmp_disable” ORDER by D.device_id DESC; [] 5.68ms]

Nagios Service - 96
Request: ‘/usr/lib/nagios/plugins/check_drobo’ ‘-H’ ‘172.16.4.55’
Can’t exec “curl”: No such file or directory at /usr/lib/nagios/plugins/check_drobo line 11.
Use of uninitialized value $content in split at /usr/lib/nagios/plugins/check_drobo line 15.
Use of uninitialized value $status in concatenation (.) or string at /usr/lib/nagios/plugins/check_drobo line 49.
Perf Data - None.
Perf Data - DS: status, Value: , UOM:
Perf Data - DS: bad, Value: 0, UOM:
Response:
Service DS: {
“status”: “”,
“bad”: “”
}
RRD[last 172.16.4.55/services-96.rrd --daemon libresql.cerento.com:42217]
RRD[update 172.16.4.55/services-96.rrd N:U:0 --daemon libresql.cerento.com:42217]
SQL[SELECT devices.*, location, lat, lng FROM devices LEFT JOIN locations ON devices.location_id=locations.id WHERE device_id = ? [3906] 1.56ms]

SQL[SELECT * FROM devices_attribs WHERE device_id = ? [3906] 0.86ms]

SQL[SELECT * FROM vrf_lite_cisco WHERE device_id = ? [3906] 1.21ms]

SQL[INSERT IGNORE INTO eventlog (device_id,reference,type,datetime,severity,message,username) VALUES (:device_id,:reference,:type,:datetime,:severity,:message,:username) {“device_id”:3906,“reference”:96,“type”:“service”,“datetime”:“2019-06-24 12:44:34”,“severity”:4,“message”:“Service ‘drobo’ changed status from Critical to OK - - “,“username”:””} 2.63ms]

SQL[UPDATE services set service_changed=?,service_status=?,service_message=? WHERE service_id=? [1561401874,0,"",96] 1.99ms]

Nagios Service - 92
Request: ‘/check_timing_status_365’ ‘10.10.190.226’
sh: 1: /check_timing_status_365: not found
Perf Data - None.
Response:

Nagios Service - 93
Request: ‘/check_pmp450_cpu’ ‘10.10.190.226’
sh: 1: /check_pmp450_cpu: not found
Perf Data - None.
Response:

Should I start a new Topic with the title. Custom services broke after latest update?

I followed the steps on this URL: Broken auto-updater (Manual intervention required)

I was hoping that the issue was related and I still encounter this issue.

Is there anything else that you need from me? I’m still experiencing this issue

I’ve rebooted the server and the polling servers again and this made no difference. When I run the following command the output stats that the directories are empty when they are in fact not. I confirmed that they have execute permissions. Here are just a few. But basically every nagios plugin is listed.

su - librenms
librenms@cerento010:~$ ./check-services.php -D
Starting service polling run:

Can’t exec “curl”: No such file or directory at /usr/lib/nagios/plugins/check_drobo line 11.
Use of uninitialized value $content in split at /usr/lib/nagios/plugins/check_drobo line 15.
Use of uninitialized value $status in concatenation (.) or string at /usr/lib/nagios/plugins/check_drobo line 49.
sh: 1: /check_timing_status_365: not found
sh: 1: /check_pmp450_cpu: not found
sh: 1: /check_timing_status: not found
sh: 1: /check_pmp450_cpu: not found
sh: 1: /check_timing_status: not found
sh: 1: /check_pmp450_cpu: not found
sh: 1: /check_timing_status_365: not found
sh: 1: /check_pmp450_cpu: not found
sh: 1: /check_pmp450_cpu: not found
sh: 1: /check_pmp450_cpu: not found

I would wager it’s something to with that nagios plug-in.

But these plugins have all been working for months. They only decided to break after a LibreNMS auto-update.

How do you know for sure it was an update that broke it? And if so what update?

All the custom services we had setup with the nagios-style scripts had been working for close to a year. Just after midnight on 6-23-19 they paged out and we started seeing these errors. Our daily.sh runs at midnight and the pages came in shortly thereafter all at the same time. We confirmed GPS sync was good on the devices themselves, which is how we know the update is what broke it.

Yes I get that but what update?

Past that I couldn’t tell you. Whatever daily.sh pulled in for changes at midnight on 6-23-19 would be the cuplrit.

You should probably start searching the issues that are described in the error … Seems more a PATH or shell issue here .
For instance, ‘curl’ is not found. So either curl is missing and you should re-install it, or the PATH variable is broken (and this is probably a shell issue in LibreNMS user homedir.

I will track that one down for sure with the one plugin, my concern is with the other plugins:

sh: 1: /check_timing_status_365: not found
sh: 1: /check_pmp450_cpu: not found
sh: 1: /check_timing_status: not found
sh: 1: /check_pmp450_cpu: not found
sh: 1: /check_timing_status: not found
sh: 1: /check_pmp450_cpu: not found
sh: 1: /check_timing_status_365: not found
sh: 1: /check_pmp450_cpu: not found
sh: 1: /check_pmp450_cpu: not found
sh: 1: /check_pmp450_cpu: not found

While the check status says these files are not found, I assure you they are there and have proper permissions. We have made NO changes to the OS on the server when this occured, so again I firmly believe it was the updates done on 6-23-19 by daily.sh.

As it seems that it is not a general issue, you’ll have to dig a little bit more to understand what’s going on.

I will check that but given that we made NO other changes to the servers I have a hard time swallowing that, but I will check.