Duplicate alert notifications

angryp · 25 September 2018 11:52

Hello!

We started to experience alert notifications to be sent out two times consecutively upon a single polling time, but it dies not happen every polling interval. The main problem here is that I cannot find a way to troubleshoot this behavior properly, as it is completely unpredictable.

From what we have found so far:

test-alert.php script does not send multiple alerts.
poller.php does send multiple alerts, but approximately once per two polling intervals. Polling time does not affect this behavior - despite the time it takes to poll a device, alerts can be sent multiple times.
The problem seems to have emerged after this commit: https://github.com/librenms/librenms/pull/9136
The problem persists with multiple alert rules and multiple devices, this is not related.
I think the problem is not related to transport, since multiple notification sending is logged in event log.
If we revert to older commit, the problem perishes: https://github.com/librenms/librenms/pull/9221

Event log sample: http://prntscr.com/kyhgjc

LibreNMS central validate output:

====================================
Component | Version
--------- | -------
LibreNMS  | 1.43-110-gd9b4fb5
DB Schema | 268
PHP       | 5.6.37-0+deb8u1
MySQL     | 5.7.20-log
RRDTool   | 1.6.0
SNMP      | NET-SNMP 5.7.2.1
====================================

LibreNMS poller validate output (with problem):

====================================
Component | Version
--------- | -------
LibreNMS  | 1.43-102-gd9b4fb5
DB Schema | 268
PHP       | 5.6.37-0+deb8u1
MySQL     | 5.7.20-log
RRDTool   | 1.6.0
SNMP      | NET-SNMP 5.7.2.1
====================================

LibreNMS poller validate output (without problem, same poller):

====================================
Component | Version
--------- | -------
LibreNMS  | 1.43-102-g2757232
DB Schema | 268
PHP       | 5.6.37-0+deb8u1
MySQL     | 5.7.20-log
RRDTool   | 1.6.0
SNMP      | NET-SNMP 5.7.2.1
====================================

Does anybody know how can I troubleshoot this behavior?

angryp · 26 September 2018 05:38

Update: false information given in the description, source version 1.43-102-g2757232 is also affected by the issue - duplicate alerts indeed are sent.

angryp · 26 September 2018 08:44

Uh-huh, I think we found it. We had multiple pollers running as cluster masters (new poller services, the ones with Redis) and after inspecting service.py it came clear they schedule alerts.php execution on master nodes. Multiple masters - multiple alerts.
Will keep an eye on this, but theory seems stable enough.

angryp · 26 September 2018 13:18

And another update: alerts are still duplicated even with a single cluster master, so there is no connection.
Redis server log also does not contain any errors. I guess we can switch to verbose logging just in case…

murrant · 26 September 2018 13:40

Odd you shouldn’t be able to have multiple masters.

angryp · 27 September 2018 13:49

@murrant, I can make some troubleshooting if I ever see this behavior? If only I knew what could help. Will strace be enough?

So here is what we had found today: even if you have only one master in a cluster, alerts are being launched on a portion of pollers. Two pollers are not executing alerts, master node also remains silent, while a lot of others are executing alerts.php once per minute on different machines. And there is no overlapping between pollers: alerts.php is never executed twice at the same minute.

Can it be that we see duplicate alerts when poller.php and alerts.php are being executed at the same time? Possible, will try to track down when polling finishes for device_id and alerts are executed.

Poller #1:

2018-09-27 13:31:00,192 Alerting(INFO):Checking alerts

Poller#2:

2018-09-27 13:17:00,175 Alerting(INFO):Checking alerts
2018-09-27 13:18:00,172 Alerting(INFO):Checking alerts
2018-09-27 13:21:00,181 Alerting(INFO):Checking alerts
2018-09-27 13:23:00,190 Alerting(INFO):Checking alerts
2018-09-27 13:27:00,185 Alerting(INFO):Checking alerts
2018-09-27 13:28:00,183 Alerting(INFO):Checking alerts
2018-09-27 13:33:00,182 Alerting(INFO):Checking alerts

Poller #3:

2018-09-27 13:14:00,278 Alerting(INFO):Checking alerts
2018-09-27 13:25:00,283 Alerting(INFO):Checking alerts

Poller #4:

2018-09-27 13:09:00,121 Alerting(INFO):Checking alerts
2018-09-27 13:24:00,134 Alerting(INFO):Checking alerts
2018-09-27 13:30:00,130 Alerting(INFO):Checking alerts
2018-09-27 13:35:00,138 Alerting(INFO):Checking alerts
2018-09-27 13:37:00,134 Alerting(INFO):Checking alerts

Poller #5:

2018-09-27 13:19:00,131 Alerting(INFO):Checking alerts

Poller #6:

2018-09-27 13:12:00,165 Alerting(INFO):Checking alerts
2018-09-27 13:22:00,171 Alerting(INFO):Checking alerts
2018-09-27 13:34:00,232 Alerting(INFO):Checking alerts
2018-09-27 13:39:00,239 Alerting(INFO):Checking alerts

Poller #7:

2018-09-27 13:15:13,299 Alerting(INFO):Checking alerts
2018-09-27 13:40:13,378 Alerting(INFO):Checking alerts

Poller #8:

Sep 27 13:16:00 librenms-service.py[20817]: Alerting(INFO):Checking alerts
Sep 27 13:20:00 librenms-service.py[20817]: Alerting(INFO):Checking alerts
Sep 27 13:38:00 librenms-service.py[20817]: Alerting(INFO):Checking alerts

Poller #9:

Sep 27 13:29:00 librenms-service.py[26280]: Alerting(INFO):Checking alerts
Sep 27 13:32:00 librenms-service.py[26280]: Alerting(INFO):Checking alerts
Sep 27 13:36:00 librenms-service.py[26280]: Alerting(INFO):Checking alerts

angryp · 28 September 2018 09:51

Yes, duplicate alerts are only sent out at the beginning of a minute, within first 20 seconds, which leads to think they are indeed sourced from alert manager in service.py.
With previous poller-service.py we had alerts.php cron job scheduled only on master instance - then there were no duplicates sent.

I had commented out alert manager calls in service.py in a test environment and service works as expected without executing any alerts.php calls. Interestingly enough, if we change settings of alerting from alerting = PollerConfig(1, 60) to alerting = PollerConfig(0, 60), alert manager will still be executed. Same applies to frequency - if alter it from 60 to something like 60000, it will still execute poll_alerting function every 1 minute.

For now, we stripped out every dispatch_alerting and alerting_manager calls and judging from debug output they are not executed. I guess on Monday we can carry out a test in production environment to see if alerts will still be duplicated. I will post more results when the time comes.

What can be told so far - yes, new poller service is much more efficient. Polling times are reduced by a good amount and MySQL connection count and operation count have been reduced almost 3 times!

angryp · 3 October 2018 15:02

And the final update, maybe it will help anyone in the future.
First off, we had commented out all alert manager calls on all the pollers except master node and duplicate alerts have vanished. Together with all alerting, except for those, which were generated from poller.php execution. This is when we had discovered that master executor floats between all the pollers all the time, despite having a single master node in cluster.
Knowing this, we commented out alert manager on central LibreNMS node as well and enabled alerts.php execution the old-fashioned way, via cron job.

This eliminated duplicate alerts and proved to be working for 3 days already - seems stable enough.

murrant · 9 October 2018 02:49

I don’t think the master should float. Have you tried inspecting redis values while it is running?

I’m pretty curious as the new python service seems pretty good overall, but would like to work out any small bugs like this.

angryp · 11 October 2018 08:42

Redis value sample by querying “KEYS *”: https://gist.github.com/angryp/c75ba088a480fca8f42f3b68a32dd69c
Redis log sample contents (continuing for a long time):

31501:M 11 Oct 08:28:35.076 * 10 changes in 300 seconds. Saving...
31501:M 11 Oct 08:28:35.080 * Background saving started by pid 6700
6700:C 11 Oct 08:28:35.095 * DB saved on disk
6700:C 11 Oct 08:28:35.098 * RDB: 1 MB of memory used by copy-on-write
31501:M 11 Oct 08:28:35.184 * Background saving terminated with success

Pollers section screenshot: http://prntscr.com/l4rgvj

All pollers are running with INFO log level. One of the hosts that declared himself a Cluster master has the following entry in the logs:

2018-10-10 18:26:57,534 Poller_18-158(ERROR):Poller poller exception! [Errno 24] Too many open files
Traceback (most recent call last):
  File "/add_disk/librenms/LibreNMS/queuemanager.py", line 58, in _service_worker
    work_func(device_id)
  File "/add_disk/librenms/LibreNMS/service.py", line 445, in poll_device
    self.call_script('poller.php', ('-h', device_id))
  File "/add_disk/librenms/LibreNMS/service.py", line 571, in call_script
    return subprocess.check_output(cmd, stderr=subprocess.STDOUT, preexec_fn=os.setsid, close_fds=True).decode()
  File "/usr/lib/python3.5/subprocess.py", line 316, in check_output
    **kwargs).stdout
  File "/usr/lib/python3.5/subprocess.py", line 383, in run
    with Popen(*popenargs, **kwargs) as process:
  File "/usr/lib/python3.5/subprocess.py", line 676, in __init__
    restore_signals, start_new_session)
  File "/usr/lib/python3.5/subprocess.py", line 1185, in _execute_child
    errpipe_read, errpipe_write = os.pipe()
OSError: [Errno 24] Too many open files

Interesting, this may be the reason. I will update ulimit and restart pollers - will see how it will go.

Here is how switching happens:

2018-10-01 10:28:23,799 poller(INFO):poller is now the master dispatcher
2018-10-01 10:29:12,521 poller(INFO):Creating queue poller:4
2018-10-01 10:29:12,931 poller(INFO):Creating queue discovery:4
2018-10-01 10:29:13,843 poller(INFO):Creating queue poller:8
2018-10-01 10:29:16,791 poller(INFO):Creating queue poller:10
2018-10-01 10:29:24,056 poller(INFO):Creating queue poller:2
2018-10-01 10:29:58,844 poller(INFO):Creating queue poller:14
2018-10-01 10:30:00,384 performance(INFO):Counting up time spent polling
2018-10-01 10:30:00,801 Thread-5(INFO):Using pure python SQL client
2018-10-01 10:30:03,833 poller(INFO):Creating queue poller:11
2018-10-01 10:30:04,351 poller(INFO):Creating queue discovery:11
2018-10-01 10:30:04,868 poller(INFO):Creating queue poller:3
2018-10-01 10:30:10,686 poller(INFO):Creating queue discovery:2
2018-10-01 10:30:11,415 poller(INFO):Creating queue poller:1
2018-10-01 10:30:12,659 poller(INFO):Creating queue discovery:1
2018-10-01 10:30:13,376 poller(INFO):Creating queue discovery:8
2018-10-01 10:30:16,156 poller(INFO):Creating queue poller:12
2018-10-01 10:30:16,666 poller(INFO):Creating queue discovery:12
2018-10-01 10:30:26,956 poller(INFO):Creating queue poller:13
2018-10-01 10:30:27,476 poller(INFO):Creating queue discovery:13
2018-10-01 10:30:42,948 poller(INFO):Creating queue poller:22
2018-10-01 10:30:47,608 poller(INFO):Creating queue poller:19
2018-10-01 10:30:49,767 poller(INFO):Creating queue poller:5
2018-10-01 10:30:50,177 poller(INFO):Creating queue discovery:5
2018-10-01 10:31:02,101 poller(INFO):poller is no longer the master dispatcher

GitStoph · 14 February 2019 15:22

@angryp - Did you find a solution to this other than commenting out the Alert Manager? I’m seeing the cluster master shift between my multiple pollers as well. pollers/tab=pollers/ is showing as ALL of my pollers believing they’re the master. The poller service is running otherwise as expected and efficiently… I’ve also increased my open file limits and let it sit overnight, and the issue reoccurred. If i restart the service on all of my pollers, only one will take the master role, but over time they all claim master status.

What log info would be useful in cleaning this up? Thanks-

murrant · 15 February 2019 05:00

The too many open files is a bug with older versions of python 3. Try to run 3.6 or 3.7 with the LibreNMS service.

angryp · 21 February 2019 05:16

Hello @GitStoph, sorry for a late reply.

I did not find a solution aside from commenting out alerting manager, because without it and the old-style alert processing LibreNMS works like a charm. I had not seen any issues in performance or processing when “multiple masters” emerge over time, so I ignore it for now.

Compared to the older version of poller process this one is a Heaven’s blessing from stability POV.

GitStoph · 22 February 2019 14:55

Agreed. I still get multiple masters, and the process has only crashed once across my four pollers thus far. Interestingly, when the service would fail with the old poller, it’d only be on that one server. When the service crashes now, it happens on all four VMs.

Other than that minor discrepancy, it’s been good. Commenting out the alert manager part of the script has worked fine thus far, thanks for catching that!

jasonpfowler · 25 February 2019 14:41

Any status on this issue?

I have the main system and 2 pollers I have at least 2 cluster masters. This setup may be incorrect where the server used for GUI access has “distributed_poller = false” while the other two are “true”. This how I had things set up with the old cron jobs and never changed it after moving to the service.

This is interesting: I have one active Alert Rule that sends an email and Slack post. Only the SMTP email gets sent twice.

murrant · 4 March 2019 19:59

There is a known bug that the email alerting cannot de-duplicate emails so if it gets added via two different things to the email recipient list, it gets sent twice. I started to try to fix this, but it is a lot more complex to fix than it first seems.