New VMs added to libreNMS for monitoring, after they are rebooted, Processors removed by libreNMS

BloodyIron · 14 November 2024 22:12

I’m provisioning some new Ubuntu 24.04 VMs and naturally I need to reboot them a few times. I’ve already added them to libreNMS via SNMP, but it seems when they get rebooted (at least so far) they suddenly lose their CPUs in libreNMS until I tell libreNMS to rediscover the damned things.

The Event Log for each host shows entries along these lines:

Processor Removed: hr 196610 QEMU Virtual version 2.5+

Within about 8 seconds after the previous entry for “Device rebooted after…”

The hypervisor the VMs run on is Proxmox VE so LinuxKVM, and the CPU type they have set to use are “x86-64-v2-AES”.

I really don’t know why this is happening and I’m not sure what to do about it.

Validate doesn’t show any errors and as far as I know no other hosts that I’m monitoring in the same libreNMS ecosystem are doing this.

Please send halp D:

laf · 14 November 2024 23:04

I can only assume that the index is changing for each processor we detect which shouldn’t happen. If you run /usr/bin/snmpbulkwalk -Cr10 -v2c -c COMMUNITY -OQUs -m HOST-RESOURCES-MIB -M /opt/librenms/mibs HOSTNAME hrProcessorLoad

Replace COMMUNITY and HOSTNAME as relevant.

Then reboot the VM and re-run it, compare the index values you see after the . before the =

BloodyIron · 15 November 2024 05:45

I have v2c disabled (as it should be) should I adjust any of that in other ways considering that? (apart from changing the protocol declaration and adding authy stuff)

BloodyIron · 15 November 2024 05:46

Also this again happened for 3x more VMs I just provisioned generally with the same CPU type aspects, etc, after rebooting them the CPUs were lost in their device page in the webGUI of libreNMS.

laf · 15 November 2024 09:16

Update the command to use v3

BloodyIron · 15 November 2024 17:20

Before reboot:
hrProcessorLoad.196608 = 1
hrProcessorLoad.196609 = 5
hrProcessorLoad.196610 = 1
hrProcessorLoad.196611 = 1

After reboot:
hrProcessorLoad = No Such Instance currently exists at this OID

BloodyIron · 15 November 2024 17:21

The values come back after like… 30 seconds? Might be some sort of initialisation thing, but it is odd.

BloodyIron · 15 November 2024 17:32

Actually, 1 minute. The metrics come back after the snmpd daemon has been operating for at least 1 minute. Whether that’s from a host reboot, or daemon restart.

I did some rough digging, and this might be related to Linux Kernel versions v6.7 and/or greater (these particular systems are rocking v6.8.0-48-generic Ubuntu SMP)

I started a net-snmp github issue on the topic: hrProcessorLoad = No Such Instance currently exists at this OID, takes 1 minute to fix · Issue #896 · net-snmp/net-snmp · GitHub

Considering the results this seems to suggest this is not a libreNMS matter. But if you have any ideas/recommendations as to what I can do for stop-gap solutions, I would LOVE to hear them please!

A touch annoying since this is an LTS version of Ubuntu Server, and I seem to be yet again the first one to find an obscure issue lol. (I have some sort of habit doing that).

laf · 17 November 2024 10:29

Glad you found what’s going on.

BloodyIron · 20 November 2024 20:23

Actually I don’t think this is related to Linux Kernel v6.7 or greater.

I just had this happen for a host running Linux Kernel 5.15.

I’d consider this still an open problem and I can’t determine the cause. Rebooting systems I generally think I need to manually check and re-discover them to get CPU metrics back

BloodyIron · 20 November 2024 20:29

Yeah this issue is present for every single system I have to reboot.

I did an environment-wide updating (and rebooting) of like 30 systems, mostly VMs a few bare-metal. Every single one lost their CPU in libreNMS.

I’ve had to go through each one manually to tell them to rediscover as I don’t want to wait till the 24hr discovery task to get CPU metrics again.

This looks like a regression in the libreNMS ecosystem I think but can’t reliably tell where yet.

laf · 21 November 2024 11:55

It’s definitely not a regression on our side if snmp just isn’t returning the data as you’ve shown earlier.

Discovery happens every 6 hours, I didn’t think we removed information on polling but it’s been a while since I looked at that part of the code. If we don’t then it’s odd why they are getting removed so quickly.

BloodyIron · 21 November 2024 19:21

Sure, and I hear you on that, but this is a very new and recent issue. The only thing that has changed in this scenario is updating of libreNMS.

I’ve clearly rebooted my systems many times in the years I’ve been using libreNMS and retained processor metrics for the hosts in libreNMS after the fact. So that begs the question, why is libreNMS suddenly ditching the CPU the moment it gets such results and not re-enabling CPU the moment it starts getting CPU metrics?

I’m trying to dig up if there’s anything I can do to make it so the OID queried gives results after daemon restart(or system reboot), but so far not coming up with any solutions on that regard. And to me it’s logical to come to the conclusion the variable here is libreNMS itself.

I even checked the SNMPD versions between affected systems and they have different versions, so I am less convinced this is a regression in SNMPD itself at this point.

murrant · 22 November 2024 14:28

Honestly, this is expected behavior on LibreNMS’ part. The device says no processors exist.

Possibly what changed is LibreNMS is faster at rediscovering down devices when they come up and discovers it before snmpd on the device shows processors. LibreNMS could only work around the issues by adding a delay before discovering.

BloodyIron · 22 November 2024 14:57

I don’t agree it’s expected at all. I’ve been using libreNMS for over 5 years now, and this is the first this has ever happened. Completely UNexpected.

Why would it suddenly start happening to all the systems I have with very different versions of Linux? Like this is a very recent effect.

murrant · 22 November 2024 15:12

I’m not saying it is desirable, just that it is expected. Device says no processors exist, then as far as LibreNMS knows, they don’t exist. To fix the issue, we have to somehow make LibreNMS not believe the device in this specific instance.

IMO, it is a timing issue why it wasn’t occurring before. But unfortunately, I don’t have time to dig in and see what is exactly happening. I suspect you are using the dispatcher service, is that correct?

BloodyIron · 22 November 2024 16:29

I have the LibreNMS SNMP Poller Service running, not sure if that constitutes the “Dispatcher Service”.

I’ve tried to periodically keep on top of the “best practices”-ish in the documentation for libreNMS, so I think I’m pretty close (if not ideal?) to what the docs recommend. But I’m all for doing a better job on my end.

As for timing issue, that may be the case, however nothing is coming to mind that is significant as an environmental change that overlaps with when this issue started happening. As in, I haven’t upgraded the underlying hardware it runs on or something like that.

I am trying to stay as objective instead of subjective as I can here, so sorry if I stray from that path at all (feel free to point out any areas I might be straying from that path). The “tea leaves” (evidence?) I have in hand doesn’t add up with “expectations” vs “reality”.

If there is more I can do on my end to help you folks help me, I’ll do what I can. Info gathering? Config tuning? Deep env arch details sharing? I’ll help, so long as it’s not sensitive data (which I know you’re not asking for here).

In this particular env I am the sole admin+arch so I can go as deep or shallow as warranted.

murrant · 22 November 2024 20:49

Hmm, there was a poller service that was deprecated and remove long ago, I hope that is not what you are running.

If you are using the dispatcher service, if an offline device misses a discovery, I think it will discover it right away when it comes up. If it doesn’t miss discovery, it will just be discovered on the normal interval. laf was trying to test if the poller was causing this issue, but I’m not sure he came to a satisfactory conclusion. Personally, I think it is discovery causing it, but I could be wrong.

If we can properly characterize the issue, it will be easier to come up with a work-around. Right now with the current info, I don’t see many options to work around this on the LibreNMS side.

laf · 22 November 2024 21:00

Yes, I checked polling and it’s definitely (99%) not that.

I could only get it to remove the processors by running discovery whilst the device responded with no processors.

Can you share the eventlog from the time the device gets rebooted and processors get removed pls.

BloodyIron · 22 November 2024 21:09

Event log from which perspective?