Simple warning from HA state change

sjohnson · 4 June 2018 18:16

Hi,

I had started to work on some HA monitoring for the Netscalers, which included a warning for an HA failover in the form of a state sensor, comparing the current value to the previous one. Here’s the alert in question:

devices.os = "Netscaler" AND sensors.sensor_type = "sysHighAvailabilityMode" AND sensors.sensor_current != `sensors.sensor_prev`

Originally, I thought that this would create an alert on the first “different” poll, after a state change, then recover the alert on the next poll, as the polled value would be the same as the previously polled value. A bit like what happens for new BGP sessions, but that counter checks if the session has been up for less than 5 minutes. However, it seems like the alert compares the “registered” values, and not the polled values, so the alarm stays up until the state actually reverts back.

Is there a way to make this alert auto-recover? The goal is to get a warning that there was a failover, both for debug and historical data, but not to actually keep this as an active alarm.

Thanks again!

murrant · 4 June 2018 18:31

FYI, in the alert collection, there is a rule to alert based on state sensor status.

sjohnson · 4 June 2018 19:00

In ./misc/alert_rules.json? Because those are the ones I had looked into to create the 3 Netscaler HA ones, but didn’t see one that compared the current and previous value, except for fixed values (like a number).

Just looked into the UI and didn’t see anything like that either, so I hope I’m not overlooking something obvious!

Kevin_Krumm · 4 June 2018 21:04

State sensors

sjohnson · 4 June 2018 21:21

OK, thanks. I had seen this for the alert, yet, but maybe I didn’t explain the issue well enough. The problem with my previous alert, is that the state stays in “Warning” and never recuperates.

What targets the warning state is not a value in particular, but just the fact that it has changed.

The problem with using the above alert template is the fact that the sensor does go into a warning state after an HA state change, but it never recups until the HA state is back to the original value.

Does that make more sense? The goal really is to just get notified that there was state change, nothing else. There’s another check for the actual “bad” states (other than primary or secondary, for example) and this one has hardcoded values that will tell if its state is in OK, Warning, Critical or Uknown.

Thanks

Josh_Rabino · 7 June 2018 20:34

I run a similar alert for F5 load balancers. Had to dig into the database to figure out what was happening.

The problem is that the sensor_prev value only gets updated on a change (i.e. only if you fail back your netscaler). To see this in action, you can run this on the db select sensor_id, sensor_current, sensor_prev, lastupdate from sensors.

The fix I worked out is a bit jhenky but works as desired.

sensors.sensor_oid = “.1.3.6.1.4.1.3375.2.1.14.3.1.0” AND sensors.sensor_prev = 4 AND (sensors.sensor_current = 3 OR sensors.sensor_current = 2 OR sensors.sensor_current = 1) AND sensors.lastupdate >= DATE_SUB(NOW(),INTERVAL 15 MINUTE)

sjohnson · 11 June 2018 14:14

Excellent, thanks! I switched the sensors.lastupdate to >= for our needs, but that did it!