Ability to recheck services before sending an alert


#1

It would be great if we could configure the ability to recheck a service (or device) before an alert is sent to prevent false positives.

From the Help forums, we would need to configure alert notifications to extend beyond the poller interval (5 minute default). It would be better to have a mechanism where we can say that when the check fails, try again after 2 seconds, if that fails, try again after 5 seconds, if that fails, send the alert. Then we can receive a more timely alert of a down service and not need to wait another 5 minutes with the service down.

In practice, I was able to reduce the false positives for devices via the fping settings (which also includes retries), but still have issues with Nagios plugins for service checks. The closest equivalent would be to use the timeout option, but it typically defaults to 10 seconds according the plugin documentation. It would be nice to try to start a second connection rather than wait longer for the one attempt in the plugin.


#2

Using the current implementation, you can run the service more often than 5 minutes , and add some delay on alerting. That way, even if you have a false failure, the alert won’t be emitted and you’ll get a change to have a 2nd (or more) run.