26.4.0 reset / corrupted sensor thresholds

Hi All,

Our system auto updated to 26.4.0 and lots of my sensor thresholds are wrong again causing numerous temperature alerts to be active when they shouldn’t.

This is very similar to what happened in version 25.8.0:

Is anyone else seeing this ?

This seems to be particularly troublesome for sensors which normally have negative values, for example:

The negative readings are correct, it’s the low limit now being zero that is incorrect, previously that would have been a large negative number for the lower limit.

All high and low limits seem to have been changed / reset from what they were before. (For example in the screenshot above Front Panel Temp previously had a manually set limit of 25 not 50)

Furthermore, it appears all sensors I had previously set to OFF (ignore and do not alert) are now ON again in the health view. :frowning:

I haven’t attempted a fix yet. Last time I had to clear all the limits in bulk with a MySQL query, run a full auto detect (to regenerate “default” sensor limits for the missing limits, which would repopulate some but not all of them) then manually tweak ones that still weren’t suitable and manually disable sensors I don’t want monitored.

I don’t think anything in 26.4.0 would have affected this, especially the setting to be OFF, I expect your sensors have been removed and re-added. Possibly the device was failing snmp queries at the time for some reason?

Hi,

If the sensors have been removed and re-added this has been done by the update, as the last time this happened was way back when version 25.8.0 came out when a similar thing happened.

I’m not sure I understand what you’re suggesting - are you saying that any SNMP devices that are not responding during the version upgrade process will have all their limit values reset to automatically generated defaults ?

If so, why, that doesn’t make any sense to do that ? I would have thought that limits should only be set/reset during initial discovery of the device.

I know the device in the screenshot has not had any downtime recently - it is a server that has not been rebooted for weeks, and the temperature values are actually coming from IPMI data from the BMC on the server hardware so even if it was rebooted IPMI will still respond.

All of the limit changes correspond exactly with the date the auto-upgrade happened.

I’ve done a manual browse across many devices and most of them seem to have their limits reset to very generic defaults, like minimum 0, maximum 120 for most temperature sensors.

This is strange because in the past when a sensor was detected upper and lower limits were usually calculated as a range on either side of the current value - but here many limits were set that immediately cause the sensor to be in an error state.

I guess I’ll need to wipe all limits again and force a redirection as it’s too labour intensive to go through fixing them individually.

Just to add another data point - it looks like the indexing of sensors on some devices has changed again like it did last time.

I have some temperature sensors linked directly on a dashboard (using the web widget deep linking to the sensor graph) and some of these now point at the wrong sensor on the same device as the sensors have been renumbered)