OK, it seems like I’m going crazy, but this has happened too many times for that . I manually change some sensor / health limits, as I want (e.g. Vcore for my CPU) … and then a day or two later, they are reset (automatically it seems). So I change them again, and the cycle repeats.
Is there some setting that is enabing this override of my manual settings?
It does not happen in general, but sometimes, with certain devices/OS, the sensors are deleted and created again. In that case you will loose any manual setting.
Could you check that your sensors are not deleted/recreated ?
If yes, the step coming next is to understand why and fix it
Yes, I do think that’s it - I say that because the (Ubuntu Linux) server that this is running on … well, the other day I happened to notice that it had “lost” all it’s drives (in LibreNMS), except for two NFS mounts. For example, even the root ("/") partition was “gone”. Not sure why it is, but I do see devices and sensors dropping on this machine - and it’s the LibreNMS server, so clearly up .
Do you have any CPU overloading on your LibreNMS server ? Or on the device being monitored ? Any high latency ?
I have this kind of behaviour with an old Mac Mini running Debian, where the sensors are changing their ID randomly during reboot. So LibreNMS discovers new sensors after 50% of reboots.
I don’t think so (could be wrong of course). The server and machine being monitored are the same (though I have seen this with other “clients” as well). As for overload, here is top output,
Yep, agreed. As this happens very infrequently (i.e. can go a few weeks between occurrences) - I need to figure out how to get logging beefed up, to help debug.
Arrgh - happened again today … and no OS updates, not even a reboot. Only “change” is the daily update to LibreNMS - I don’t see sensors going away / being re-added, but custom settings are reset again.
Is there a way to have debug output captured for all polls (for a single device)? Just asking because it’s very random, really need to capture it all to debug.
I have a Linux MacMini which shows a similar pattern. after reboot, thermal sensors and fans will change OID, flapping between 2 different values. So sensor gets recreated, new default min/max values, and even more fun, the RRDs are kept so I have one RRD for OID1 and one RRD for OID2, the sum of both covers 100% of time …
Don’t know how and why the SNMP Agent behave s like this.
Unfortunately, it seems to be an issue in the device itself, not in LNMS. LNMS only receive the SNMP data and cannot change it.
I did not find much description on how lm-sensors defines the IDs…
But it’s not just lm-sensors … my (custom) threshold for SSD (storage) usage is also being reset. Or is it that when lm-sensors changes, all limits are being changed / reset?
OK, found the trigger I think! Not good detective work on my part, sort of tripped over it
I happened to run some OS upgrades, and found that there was a misalignment between reported storage usage and reality. So I triggered a rediscover from the UI. It worked, but it also reset all of my thresholds! Or at least for storage, the one I was looking at.
So it seems - on reboot (and perhaps kernel update, or some other trigger?), rediscover is being run - and in the process, changing the thresholds.
Thresholds are changed when then sensor/storage/entity is created only. But if for some reason, the sensor/storage/entity is changing ID, meaning deleted and created again right after, then you end up with reset thresholds.
Yep, that is a beahviour I see with at least 1 MacMini running Debian, net-snmp with lm-sensors. Don’t know why NetSNMP keeps renumbering the sensors, but each time they change ID, they get deleted/rediscovered and all thresholds are reseted. There is nothing really LibreNMS can do about it, because ID is the only way to identify a sensor from one poll/discovery to another.