I’m provisioning some new Ubuntu 24.04 VMs and naturally I need to reboot them a few times. I’ve already added them to libreNMS via SNMP, but it seems when they get rebooted (at least so far) they suddenly lose their CPUs in libreNMS until I tell libreNMS to rediscover the damned things.
The Event Log for each host shows entries along these lines:
Processor Removed: hr 196610 QEMU Virtual version 2.5+
Within about 8 seconds after the previous entry for “Device rebooted after…”
The hypervisor the VMs run on is Proxmox VE so LinuxKVM, and the CPU type they have set to use are “x86-64-v2-AES”.
I really don’t know why this is happening and I’m not sure what to do about it.
Validate doesn’t show any errors and as far as I know no other hosts that I’m monitoring in the same libreNMS ecosystem are doing this.
I can only assume that the index is changing for each processor we detect which shouldn’t happen. If you run /usr/bin/snmpbulkwalk -Cr10 -v2c -c COMMUNITY -OQUs -m HOST-RESOURCES-MIB -M /opt/librenms/mibs HOSTNAME hrProcessorLoad
Replace COMMUNITY and HOSTNAME as relevant.
Then reboot the VM and re-run it, compare the index values you see after the . before the =
I have v2c disabled (as it should be) should I adjust any of that in other ways considering that? (apart from changing the protocol declaration and adding authy stuff)
Also this again happened for 3x more VMs I just provisioned generally with the same CPU type aspects, etc, after rebooting them the CPUs were lost in their device page in the webGUI of libreNMS.
Actually, 1 minute. The metrics come back after the snmpd daemon has been operating for at least 1 minute. Whether that’s from a host reboot, or daemon restart.
I did some rough digging, and this might be related to Linux Kernel versions v6.7 and/or greater (these particular systems are rocking v6.8.0-48-generic Ubuntu SMP)
Considering the results this seems to suggest this is not a libreNMS matter. But if you have any ideas/recommendations as to what I can do for stop-gap solutions, I would LOVE to hear them please!
A touch annoying since this is an LTS version of Ubuntu Server, and I seem to be yet again the first one to find an obscure issue lol. (I have some sort of habit doing that).
Actually I don’t think this is related to Linux Kernel v6.7 or greater.
I just had this happen for a host running Linux Kernel 5.15.
I’d consider this still an open problem and I can’t determine the cause. Rebooting systems I generally think I need to manually check and re-discover them to get CPU metrics back
It’s definitely not a regression on our side if snmp just isn’t returning the data as you’ve shown earlier.
Discovery happens every 6 hours, I didn’t think we removed information on polling but it’s been a while since I looked at that part of the code. If we don’t then it’s odd why they are getting removed so quickly.
Sure, and I hear you on that, but this is a very new and recent issue. The only thing that has changed in this scenario is updating of libreNMS.
I’ve clearly rebooted my systems many times in the years I’ve been using libreNMS and retained processor metrics for the hosts in libreNMS after the fact. So that begs the question, why is libreNMS suddenly ditching the CPU the moment it gets such results and not re-enabling CPU the moment it starts getting CPU metrics?
I’m trying to dig up if there’s anything I can do to make it so the OID queried gives results after daemon restart(or system reboot), but so far not coming up with any solutions on that regard. And to me it’s logical to come to the conclusion the variable here is libreNMS itself.
I even checked the SNMPD versions between affected systems and they have different versions, so I am less convinced this is a regression in SNMPD itself at this point.
Honestly, this is expected behavior on LibreNMS’ part. The device says no processors exist.
Possibly what changed is LibreNMS is faster at rediscovering down devices when they come up and discovers it before snmpd on the device shows processors. LibreNMS could only work around the issues by adding a delay before discovering.
I don’t agree it’s expected at all. I’ve been using libreNMS for over 5 years now, and this is the first this has ever happened. Completely UNexpected.
Why would it suddenly start happening to all the systems I have with very different versions of Linux? Like this is a very recent effect.
I’m not saying it is desirable, just that it is expected. Device says no processors exist, then as far as LibreNMS knows, they don’t exist. To fix the issue, we have to somehow make LibreNMS not believe the device in this specific instance.
IMO, it is a timing issue why it wasn’t occurring before. But unfortunately, I don’t have time to dig in and see what is exactly happening. I suspect you are using the dispatcher service, is that correct?
I have the LibreNMS SNMP Poller Service running, not sure if that constitutes the “Dispatcher Service”.
I’ve tried to periodically keep on top of the “best practices”-ish in the documentation for libreNMS, so I think I’m pretty close (if not ideal?) to what the docs recommend. But I’m all for doing a better job on my end.
As for timing issue, that may be the case, however nothing is coming to mind that is significant as an environmental change that overlaps with when this issue started happening. As in, I haven’t upgraded the underlying hardware it runs on or something like that.
I am trying to stay as objective instead of subjective as I can here, so sorry if I stray from that path at all (feel free to point out any areas I might be straying from that path). The “tea leaves” (evidence?) I have in hand doesn’t add up with “expectations” vs “reality”.
If there is more I can do on my end to help you folks help me, I’ll do what I can. Info gathering? Config tuning? Deep env arch details sharing? I’ll help, so long as it’s not sensitive data (which I know you’re not asking for here).
In this particular env I am the sole admin+arch so I can go as deep or shallow as warranted.
Hmm, there was a poller service that was deprecated and remove long ago, I hope that is not what you are running.
If you are using the dispatcher service, if an offline device misses a discovery, I think it will discover it right away when it comes up. If it doesn’t miss discovery, it will just be discovered on the normal interval. laf was trying to test if the poller was causing this issue, but I’m not sure he came to a satisfactory conclusion. Personally, I think it is discovery causing it, but I could be wrong.
If we can properly characterize the issue, it will be easier to come up with a work-around. Right now with the current info, I don’t see many options to work around this on the LibreNMS side.