LibreNMS 23.4.0 on CentOS 7 high cpu & long pulling time for 1000 device

Magd_nuh · 6 April 2023 19:47

after i update to 23 versoion latest on centos 7 the cpu gets high in the ./validate command line no errors appearing in the gui validate web FAIL: No composer available, please install composer
error is there any help to reduce this poller long time becouse i get many graphs disconnect my poller-wraper set to 24 my cpu is 20 core 45G ram my devices is 1800 my cpuy 100%

electrocret · 6 April 2023 20:53

If you’re running the daily update channel (Global Settings > System > Updates > Update Channel), there was a recent bug fix (PR#14894) which fixes Libre not respecting max_oid in SNMP queries. Since that fix was applied, I’ve noticed more CPU usage on my pollers (However shorter poller time per device). On a system which is already very busy, like yours, Libre’s loss of performance of breaking up SNMP queries could be very detrimental.

The easy solution would be to increase the max_oid setting (Global Settings > Poller > SNMP > Max OIDs).

The better solution would be to investigate distributed polling for your Libre.

Magd_nuh · 6 April 2023 22:27

whats the recommended limits for MAX OID SNMP to use it thanks

electrocret · 8 April 2023 01:29

High enough that your Libre CPU returns to normal?

Libre hasn’t been limited on OIDs until the bug was fixed. So if you haven’t had issues polling devices until now you shouldn’t have an issue setting it to a high number. The OID limit is largely for lower end equipment, or devices where Libre gets a lot of data points. It’s to prevent overwhelming the device.

If I were in your position, I’d incrementing by 10, wait half and hour. If the CPU hasn’t gotten better repeat till it does.

If you’re not concerned about overwhelming your equipment, just set the max_oid to a very high value.

Magd_nuh · 8 April 2023 02:27

Thanks alot i did restore to old version of Librenms before update 21 version everthing back to normal i use it to monitor my client devices so i couldnt make it down for long time , there was so many cuts and alot of graph drops so i decide to back old version now polling 1600 device with 480 second polling time , when i make update it was 900 to 1100
thanks for your help

joeschmo · 10 April 2023 14:20

To clarify I think this is regarding changes between 23.2.0 → 23.4.0.

I also am running a fully updated Centos 7 VM, with about 1100 devices. It always ran pretty hot but was able to poll all devices without falling behind.

As of the 23.4.0 release, it no longer is able to poll all of my devices. I spent the greater part of the weekend troubleshooting and have not found a solution.

I disabled polling on all devices, then did a reboot. Over the next 6 hours or so I slowly enabled polling on devices in groups of 50. After each group was added I waited until all devices showed up as being polled before continuing. By the time I reached 825 devices (just 75% of my full list) the server CPU was pegged at 100%. Based on this alone 23.4.0 requires about 25% more resources than the previous version.

I also saw this post during troubleshooting and tried everything I could with the MAX OID SNMP setting. I tried everything from 20->500. The lower numbers made no difference, the higher numbers actually caused my server to slowly fall behind on polling and eventually get hung at 100% CPU utilization.

It would be good to get an acknowledgement of “We are aware and looking at it” or at least “This is the new normal” because a 25% performance impact is a pretty major issue IMO.

BTW: I have 2 VMs with identical setup, polling the same set of devices for the sake of redundancy and they both encountered this issue at the same time.

murrant · 10 April 2023 14:55

I’m not able to reproduce this on any of my installs.

So we need anyone having this issue to do the troubleshooting until we can find the issue.

murrant · 10 April 2023 16:46

Thanks to @joeschmo I was able to find the bug.

Fixed here, but not released to stable yet: Fix SnmpQuery and max_oid by murrant · Pull Request #14955 · librenms/librenms · GitHub
You can apply that now with ./scripts/github-apply 14955, but you will have to remove it later to update.

The bug was such that the the higher your max_oid was, the worse it was. Basically anything less than max_oid was queried individually instead of together.

joeschmo · 11 April 2023 13:55

Confirmed that PR14955 corrected the high CPU utilization issue I was seeing in 23.4.0, my system is back to 23.2.0 performance after applying the patch. Thank you @murrant !

SantiagoSilvaZ · 12 April 2023 22:37

Hi,

My LibreNMS is at 100% CPU since version update

@murrant, use the Monthly version, will a 23.4.1 version be released with this PR or do I have to wait for the 23.5.0 version?

system · 11 July 2023 22:37

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.