LibreNMS stop pushing alerts via Transport

Tags: #<Tag:0x00007fdb78fb00d0>

HI Team,

I am using distributed poller setup with two pollers and Database servers running in Galera Cluster.

./validate.php

Component Version
LibreNMS 1.69-1-gbc02ab3f6
DB Schema 2020_07_27_00522_alter_devices_snmp_algo_columns (188)
PHP 7.2.24-0ubuntu0.18.04.6
Python 3.6.9
MySQL 10.1.47-MariaDB-0ubuntu0.18.04.1
RRDTool 1.7.0
SNMP NET-SNMP 5.7.3
OpenSSL

====================================

[OK] Composer Version: 1.10.22
[OK] Dependencies up-to-date.
[OK] Database connection successful
[OK] Database schema correct
[WARN] PHP version 7.3 is the minimum supported version as of November, 2020. We recommend you update PHP to a supported version (7.4 suggested) to continue to receive updates. If you do not update PHP, LibreNMS will continue to function but stop receiving bug fixes and updates.
[WARN] Your install is over 24 hours out of date, last update: Tue, 03 Nov 2020 01:56:49 +0000
[FIX]:
Make sure your daily.sh cron is running and run ./daily.sh by hand to see if there are any errors.
[WARN] Your local git branch is not master, this will prevent automatic updates.
[FIX]:
You can switch back to master with git checkout master


Issue

From yesterday, our Librenms stop sending alerts via Slack. I have tested Alert Transports and I do receive test alerts. I have also tried running ./alert.php but don’t get an output. However, when i add debug to it, it only shows the following

DEBUG!
SQL[update cache_locks set owner = ?, expiration = ? where key = ? and (owner = ? or expiration <= ?) [“BNcin1MMPQPHO9AE”,1619730704,“laravel_cachealerts”,“BNcin1MMPQPHO9AE”,1619644304] 0.48ms]

======================

I have also tested both test-alert.php and when i run it, I do receive alerts in my slack channels.

In addition to that I have also tested “test-template.php” and the following is the output of that command

SQL[select * from devices where hostname = ? limit 1 [“10.1.96.196”] 0.75ms]

SQL[SELECT alerts.id, alerts.alerted, alerts.device_id, alerts.rule_id, alerts.state, alerts.note, alerts.info FROM alerts WHERE alerts.device_id=358 && alerts.rule_id=76 [] 0.49ms]

SQL[SELECT alert_log.id,alert_log.rule_id,alert_log.device_id,alert_log.state,alert_log.details,alert_log.time_logged,alert_rules.rule,alert_rules.severity,alert_rules.extra,alert_rules.name,alert_rules.query,alert_rules.builder,alert_rules.proc FROM alert_log,alert_rules WHERE alert_log.rule_id = alert_rules.id && alert_log.device_id = ? && alert_log.rule_id = ? && alert_rules.disabled = 0 ORDER BY alert_log.id DESC LIMIT 1 [358,76] 0.46ms]

SQL[SELECT DISTINCT a.* FROM alert_rules a
LEFT JOIN alert_device_map d ON a.id=d.rule_id AND (a.invert_map = 0 OR a.invert_map = 1 AND d.device_id = ?)
LEFT JOIN alert_group_map g ON a.id=g.rule_id AND (a.invert_map = 0 OR a.invert_map = 1 AND g.group_id IN (SELECT DISTINCT device_group_id FROM device_group_device WHERE device_id = ?))
LEFT JOIN alert_location_map l ON a.id=l.rule_id AND (a.invert_map = 0 OR a.invert_map = 1 AND l.location_id IN (SELECT DISTINCT location_id FROM devices WHERE device_id = ?))
LEFT JOIN device_group_device dg ON g.group_id=dg.device_group_id AND dg.device_id = ?
WHERE a.disabled = 0 AND (
(d.device_id IS NULL AND g.group_id IS NULL)
OR (a.invert_map = 0 AND (d.device_id=? OR dg.device_id=?))
OR (a.invert_map = 1 AND (d.device_id != ? OR d.device_id IS NULL) AND (dg.device_id != ? OR dg.device_id IS NULL))
) [358,358,358,358,358,358,358,358] 0.63ms]

SQL[SELECT hostname, sysName, sysDescr, sysContact, os, type, ip, hardware, version, purpose, notes, uptime, status, status_reason, locations.location FROM devices LEFT JOIN locations ON locations.id = devices.location_id WHERE device_id = ? [358] 0.44ms]

SQL[select * from devices_attribs where devices_attribs.device_id = ? and devices_attribs.device_id is not null [358] 0.38ms]

SQL[select * from device_perf where device_id = ? order by timestamp desc limit 1 [358] 0.41ms]

SQL[select * from alert_templates where exists (select * from alert_template_map where alert_templates.id = alert_template_map.alert_templates_id and alert_rule_id = ?) limit 1 [76] 0.46ms]

SQL[select * from alert_templates where name = ? limit 1 [“Default Alert Template”] 0.44ms]

Array
(
[hostname] => 10.1.96.196
[sysName] => vik-jumpbox
[sysDescr] => Hardware: Intel64 Family 6 Model 37 Stepping 1 AT/AT COMPATIBLE - Software: Windows Version 6.3 (Build 19042 Multiprocessor Free)
[sysContact] =>
[os] => windows
[type] => server
[ip] =>
[hardware] => Intel x64
[version] => 10 (NT 6.3)
[serial] =>
[features] =>
[location] =>
[uptime] => 795
[uptime_short] => 13m 15s
[uptime_long] => 13 minutes 15 seconds
[description] =>
[notes] =>
[alert_notes] =>
[device_id] => 358
[rule_id] => 76
[id] => 35149
[proc] =>
[status] => 0
[status_reason] => icmp
[ping_timestamp] =>
[ping_loss] => 100
[ping_min] => 0
[ping_max] => 0
[ping_avg] => 0
[debug] => Array
(
)

[title] => Alert for device 10.1.96.196 - P3_ASE_Servers_Device Down! Due to no ICMP response.
[faults] => Array
    (
        [1] => Array
            (
                [device_id] => 358
                [inserted] => 2021-04-28 20:30:34
                [hostname] => 10.1.96.196
                [sysName] => vik-jumpbox
                [ip] => 
                [overwrite_ip] => 
                [community] => public
                [authlevel] => 
                [authname] => 
                [authpass] => 
                [authalgo] => 
                [cryptopass] => 
                [cryptoalgo] => 
                [snmpver] => v2c
                [port] => 161
                [transport] => udp
                [timeout] => 
                [retries] => 
                [snmp_disable] => 0
                [bgpLocalAs] => 
                [sysObjectID] => .1.3.6.1.4.1.311.1.1.3.1.1
                [sysDescr] => Hardware: Intel64 Family 6 Model 37 Stepping 1 AT/AT COMPATIBLE - Software: Windows Version 6.3 (Build 19042 Multiprocessor Free)
                [sysContact] => 
                [version] => 10 (NT 6.3)
                [hardware] => Intel x64
                [features] => Multiprocessor
                [location_id] => 
                [os] => windows
                [status] => 0
                [status_reason] => icmp
                [ignore] => 0
                [disabled] => 0
                [uptime] => 795
                [agent_uptime] => 0
                [last_polled] => 2021-04-29 06:55:26
                [last_poll_attempted] => 
                [last_polled_timetaken] => 2.7
                [last_discovered_timetaken] => 3.06
                [last_discovered] => 2021-04-29 06:53:10
                [last_ping] => 2021-04-29 06:55:26
                [last_ping_timetaken] => 0.95
                [purpose] => 
                [type] => server
                [serial] => 
                [icon] => windows.svg
                [poller_group] => 6
                [override_sysLocation] => 0
                [notes] => 
                [port_association_mode] => 1
                [max_depth] => 0
                [disable_notify] => 0
                [string] => sysObjectID = .1.3.6.1.4.1.311.1.1.3.1.1; sysDescr = Hardware: Intel64 Family 6 Model 37 Stepping 1 AT/AT COMPATIBLE - Software: Windows Version 6.3 (Build 19042 Multiprocessor Free); 
            )

    )

[elapsed] => 1m 56s
[builder] => {"condition":"AND","rules":[{"id":"macros.device_down","field":"macros.device_down","type":"integer","input":"radio","operator":"equal","value":"1"},{"id":"devices.status_reason","field":"devices.status_reason","type":"string","input":"text","operator":"equal","value":"icmp"}],"valid":true}
[uid] => 35149
[alert_id] => 4558
[severity] => critical
[rule] => 
[name] => P3_ASE_Servers_Device Down! Due to no ICMP response.
[timestamp] => 2021-04-29 07:00:26
[contacts] => Array
    (
    )

[state] => 1
[alerted] => 0
[transport] => slack
[msg] => Alert for device 10.1.96.196 - P3_ASE_Servers_Device Down! Due to no ICMP response.

Severity: critical
Timestamp: 2021-04-29 07:00:26
Unique-ID: 35149
Rule: P3_ASE_Servers_Device Down! Due to no ICMP response. Faults:
1: sysObjectID = .1.3.6.1.4.1.311.1.1.3.1.1; sysDescr = Hardware: Intel64 Family 6 Model 37 Stepping 1 AT/AT COMPATIBLE - Software: Windows Version 6.3 (Build 19042 Multiprocessor Free);
Alert sent to:

)

etc/cron.d/librenms ----> Master poller

Using this cron file requires an additional user on your system, please see install docs.

33 */6 * * * librenms /opt/librenms/cronic /opt/librenms/discovery-wrapper.py 1
*/5 * * * * librenms /opt/librenms/discovery.php -h new >> /dev/null 2>&1
*/5 * * * * librenms /opt/librenms/cronic /opt/librenms/poller-wrapper.py 16

          • librenms /opt/librenms/alerts.php >> /dev/null 2>&1
            */5 * * * * librenms /opt/librenms/poll-billing.php >> /dev/null 2>&1
            01 * * * * librenms /opt/librenms/billing-calculate.php >> /dev/null 2>&1
            */5 * * * * librenms /opt/librenms/check-services.php >> /dev/null 2>&1

Daily maintenance script. DO NOT DISABLE!

If you want to modify updates:

Switch to monthly stable release: https://docs.librenms.org/General/Releases/

Disable updates: https://docs.librenms.org/General/Updating/

15 0 * * * librenms /opt/librenms/daily.sh >> /dev/null 2>&1

etc/cron.d/librenms ----> Second poller

----> WHile troubleshooting i have commented alert cron job on second poller

Using this cron file requires an additional user on your system, please see install docs.

33 */6 * * * librenms /opt/librenms/cronic /opt/librenms/discovery-wrapper.py 1
*/5 * * * * librenms /opt/librenms/discovery.php -h new >> /dev/null 2>&1
/5 * * * * librenms /opt/librenms/cronic /opt/librenms/poller-wrapper.py 16
#
* * * * librenms /opt/librenms/alerts.php >> /dev/null 2>&1
*/5 * * * * librenms /opt/librenms/poll-billing.php >> /dev/null 2>&1
01 * * * * librenms /opt/librenms/billing-calculate.php >> /dev/null 2>&1
*/5 * * * * librenms /opt/librenms/check-services.php >> /dev/null 2>&1

Daily maintenance script. DO NOT DISABLE!

If you want to modify updates:

Switch to monthly stable release: https://docs.librenms.org/General/Releases/

Disable updates: https://docs.librenms.org/General/Updating/

15 0 * * * librenms /opt/librenms/daily.sh >> /dev/null 2>&1

Regards,

Vik

I also wanted to add one more thing that i just noticed. I was checking Recent events of the device and i wasn’t seeing Transport Event which normally shows up when the alerts get triggered. It looks like Librenms is not able to start the process.

I have read multiple articles but not able to locate the root cause of this issue. Attaching the “Recent Event” snapshots of both existing and expected as a supporting document

Existing
2021-04-29 07_27_52-Window

Expected - Issued to transport Slack

Are you using proxy for your server ?
What are you using for distributed pollers? memcached, redis or mysql ?

HI @paulierco,

I am not using any proxy server.
I have two distributed pollers, x1 Storage server with RddCache and MemCache services and x3 Database servers in Galera Cluster.

The funny thing is, after I tried all the steps mentioned above including disabling alert,php in the cronjob on non-master poller, it started working after 24 hours but it spitted out so many alerts. Looks like it has transported alerts stucked in the queue somewhere.

You should always start by updating to the latest version