Group of False Positives, Reachable but wont Transport Recovery

Curt_Cooper · 24 June 2019 01:11

I recently had a bunch of monitored devices Alert for various reasosn, GPS Sync Drop, ICMP drop. However I can ping these devices just fine and GPS Sync is good. I can even ping these devices from the server with no issues.

SNMP walks work as well. Same with the Poller and Discovery debug. I’ve tried rebooting the server and the pollers and this has made no difference.

I was wondering if anyone has ran into this and what I can do to clear this. I’m worried that if these ACKd nodes actually go down I will no be notified.

Kevin_Krumm · 24 June 2019 02:35

What alert rules are using? Post them also post your validation by running validate.php

Curt_Cooper · 24 June 2019 15:21

I posted the Alert Configs etc but found out that these are unrelated to my issue so I removed them from the thread. I had a more experienced tech look at this and it would appear that ever since an automatic update LibreNMS is not seeing some of our modules. The ones specifically for GPS timing and AP Traffic monitoring. We confirmed that the modules are still located in the directories.

Kevin_Krumm · 24 June 2019 15:42

the validate still helps

Curt_Cooper · 24 June 2019 16:30

====================================

Component	Version
LibreNMS	1.52-70-gf3ba8947f
DB Schema	2019_05_30_225937_device_groups_rewrite (135)
PHP	7.2.14-1+0~20190205200805.15+stretch~1.gbpd83c69
MySQL	10.1.26-MariaDB-0+deb9u1
RRDTool	1.6.0
SNMP	NET-SNMP 5.7.3

====================================

[OK] Composer Version: 1.8.6
[OK] Dependencies up-to-date.
[OK] Database connection successful
[FAIL] Database: extra table (vw_alertlog_updown)
[FAIL] We have detected that your database schema may be wrong, please report the following to us on Discord (https://t.libren.ms/discord) or the community site (https://t.libren.ms/5gscd):
[FIX]:
Run the following SQL statements to fix.
SQL Statements:
DROP TABLE vw_alertlog_updown;
[FAIL] The poller (cerento012) has not completed within the last 5 minutes, check the cron job.
[FAIL] Discovery has not completed in the last 24 hours.
[FIX]:
Check the cron job to make sure it is running and using discovery-wrapper.py
[WARN] Your local git contains modified files, this could prevent automatic updates.
[FIX]:
You can fix this with ./scripts/github-remove
Modified Files:
includes/definitions/discovery/pmp.yaml

Curt_Cooper · 24 June 2019 18:47

It looks like LibreNMS is unable to find the files placed in the custom plugin director. The permissions on the files are set top read/execute and the files are intact. This starting happening after an update on 6/23/2019

librenms@cerento010:~$ ./check-services.php -d
DEBUG!
Starting service polling run:

SQL[SELECT D.,S.,attrib_value FROM devices AS D INNER JOIN services AS S ON S.device_id = D.device_id AND D.disabled = 0 LEFT JOIN devices_attribs as A ON D.device_id = A.device_id AND A.attrib_type = “override_icmp_disable” ORDER by D.device_id DESC; [] 5.68ms]

Nagios Service - 96
Request: ‘/usr/lib/nagios/plugins/check_drobo’ ‘-H’ ‘172.16.4.55’
Can’t exec “curl”: No such file or directory at /usr/lib/nagios/plugins/check_drobo line 11.
Use of uninitialized value $content in split at /usr/lib/nagios/plugins/check_drobo line 15.
Use of uninitialized value $status in concatenation (.) or string at /usr/lib/nagios/plugins/check_drobo line 49.
Perf Data - None.
Perf Data - DS: status, Value: , UOM:
Perf Data - DS: bad, Value: 0, UOM:
Response:
Service DS: {
“status”: “”,
“bad”: “”
}
RRD[last 172.16.4.55/services-96.rrd --daemon libresql.cerento.com:42217]
RRD[update 172.16.4.55/services-96.rrd N:U:0 --daemon libresql.cerento.com:42217]
SQL[SELECT devices.*, location, lat, lng FROM devices LEFT JOIN locations ON devices.location_id=locations.id WHERE device_id = ? [3906] 1.56ms]

SQL[SELECT * FROM devices_attribs WHERE device_id = ? [3906] 0.86ms]

SQL[SELECT * FROM vrf_lite_cisco WHERE device_id = ? [3906] 1.21ms]

SQL[INSERT IGNORE INTO eventlog (device_id,reference,type,datetime,severity,message,username) VALUES (:device_id,:reference,:type,:datetime,:severity,:message,:username) {“device_id”:3906,“reference”:96,“type”:“service”,“datetime”:“2019-06-24 12:44:34”,“severity”:4,“message”:“Service ‘drobo’ changed status from Critical to OK - - “,“username”:””} 2.63ms]

SQL[UPDATE services set service_changed=?,service_status=?,service_message=? WHERE service_id=? [1561401874,0,"",96] 1.99ms]

Nagios Service - 92
Request: ‘/check_timing_status_365’ ‘10.10.190.226’
sh: 1: /check_timing_status_365: not found
Perf Data - None.
Response:

Nagios Service - 93
Request: ‘/check_pmp450_cpu’ ‘10.10.190.226’
sh: 1: /check_pmp450_cpu: not found
Perf Data - None.
Response:

Curt_Cooper · 25 June 2019 18:37

Should I start a new Topic with the title. Custom services broke after latest update?

Curt_Cooper · 27 June 2019 18:04

I followed the steps on this URL: Broken auto-updater (Manual intervention required)

I was hoping that the issue was related and I still encounter this issue.

Curt_Cooper · 1 July 2019 17:21

Is there anything else that you need from me? I’m still experiencing this issue

Curt_Cooper · 3 July 2019 16:47

I’ve rebooted the server and the polling servers again and this made no difference. When I run the following command the output stats that the directories are empty when they are in fact not. I confirmed that they have execute permissions. Here are just a few. But basically every nagios plugin is listed.

su - librenms
librenms@cerento010:~$ ./check-services.php -D
Starting service polling run:

Can’t exec “curl”: No such file or directory at /usr/lib/nagios/plugins/check_drobo line 11.
Use of uninitialized value $content in split at /usr/lib/nagios/plugins/check_drobo line 15.
Use of uninitialized value $status in concatenation (.) or string at /usr/lib/nagios/plugins/check_drobo line 49.
sh: 1: /check_timing_status_365: not found
sh: 1: /check_pmp450_cpu: not found
sh: 1: /check_timing_status: not found
sh: 1: /check_pmp450_cpu: not found
sh: 1: /check_timing_status: not found
sh: 1: /check_pmp450_cpu: not found
sh: 1: /check_timing_status_365: not found
sh: 1: /check_pmp450_cpu: not found
sh: 1: /check_pmp450_cpu: not found
sh: 1: /check_pmp450_cpu: not found

Kevin_Krumm · 3 July 2019 20:27

I would wager it’s something to with that nagios plug-in.

Curt_Cooper · 3 July 2019 21:49

But these plugins have all been working for months. They only decided to break after a LibreNMS auto-update.

Kevin_Krumm · 4 July 2019 00:09

How do you know for sure it was an update that broke it? And if so what update?

Brandon_Shiers · 5 July 2019 14:33

All the custom services we had setup with the nagios-style scripts had been working for close to a year. Just after midnight on 6-23-19 they paged out and we started seeing these errors. Our daily.sh runs at midnight and the pages came in shortly thereafter all at the same time. We confirmed GPS sync was good on the devices themselves, which is how we know the update is what broke it.

Kevin_Krumm · 5 July 2019 15:23

Yes I get that but what update?

Brandon_Shiers · 5 July 2019 16:43

Past that I couldn’t tell you. Whatever daily.sh pulled in for changes at midnight on 6-23-19 would be the cuplrit.

PipoCanaja · 6 July 2019 15:31

You should probably start searching the issues that are described in the error … Seems more a PATH or shell issue here .
For instance, ‘curl’ is not found. So either curl is missing and you should re-install it, or the PATH variable is broken (and this is probably a shell issue in LibreNMS user homedir.

Brandon_Shiers · 7 July 2019 20:49

I will track that one down for sure with the one plugin, my concern is with the other plugins:

sh: 1: /check_timing_status_365: not found
sh: 1: /check_pmp450_cpu: not found
sh: 1: /check_timing_status: not found
sh: 1: /check_pmp450_cpu: not found
sh: 1: /check_timing_status: not found
sh: 1: /check_pmp450_cpu: not found
sh: 1: /check_timing_status_365: not found
sh: 1: /check_pmp450_cpu: not found
sh: 1: /check_pmp450_cpu: not found
sh: 1: /check_pmp450_cpu: not found

While the check status says these files are not found, I assure you they are there and have proper permissions. We have made NO changes to the OS on the server when this occured, so again I firmly believe it was the updates done on 6-23-19 by daily.sh.

PipoCanaja · 7 July 2019 21:22

As it seems that it is not a general issue, you’ll have to dig a little bit more to understand what’s going on.

Could you check the service documentation and validate that you don’t have anything missing ?
– https://docs.librenms.org/Extensions/Services/
PATH beeing wrong, as already suggested, is a potential explanation.

Brandon_Shiers · 8 July 2019 12:35

I will check that but given that we made NO other changes to the servers I have a hard time swallowing that, but I will check.