High CPU usage

Trying to fix an issue where some (4) HPE switches are reporting High CPU usage (>80%), when I verify with show system info on switch, the CPU utilization is <10%.
And when i use snmpget from the same librenms server with same OID, I get <10.

From debug:
Attempting to initialize OS: procurve

OS initilized as Generic

SQL[SELECT * FROM processors WHERE device_id=? [90] 0.74ms]

SNMP[’/usr/bin/snmpget’ ‘-v2c’ ‘-c’ 'COMMUNITY ‘-OUQn’ ‘-M’ ‘/opt/librenms/mibs:/opt/librenms/mibs/hp’ ‘-t’ ‘15’ ‘-r’ ‘3’ ‘udp:HOSTNAME:161’ ‘.1.3.6.1.4.1.11.2.14.11.5.1.9.6.1.0’]

.... = 95

array (
‘.1.3.6.1.4.1.11.2.14.11.5.1.9.6.1.0’ => ‘95’,
)

95%

From server:
root@netm:/opt/librenms# snmpget -v 2c -c public-OUQn -t 15 -r 3 HOSTNAME .1.3.6.1.4.1.11.2.14.11.5.1.9.6.1.0

.1.3.6.1.4.1.11.2.14.11.5.1.9.6.1.0 = 0

./validate.php

====================================

Component Version
LibreNMS 1.47-23-g05458c006
DB Schema 279
PHP 7.0.33-0+deb9u1
MySQL 10.1.26-MariaDB-0+deb9u1
RRDTool 1.6.0
SNMP NET-SNMP 5.7.3

====================================

[OK] Composer Version: 1.8.0
[OK] Dependencies up-to-date.
[OK] Database connection successful
[OK] Database schema correct
[WARN] Your install is over 24 hours out of date, last update: Wed, 09 Jan 2019 12:29:22 +0000
[FIX]:
Make sure your daily.sh cron is running and run ./daily.sh by hand to see if there are any errors.

thats not supposed to happen?

Hi,

A reason might be the polling itself. Depending on how the switch calculates the CPU usage (instant or averaged) the high amount of snmp requests is in fact creating a peak at each and every poll. So the librenms graph will show 80% flat whereas the real graph should be 80% peak at each poll and 10% average the rest of the time.
Unfortunately, there is no easy way to avoid this bias. And the peak load is a real peak of CPU usage so it is stil interesting to know it happened …

PipoCanaja

1 Like

How would I go about fixing that?

I check other switches that are same type without this issue, same “OS initilized as Generic” was in the output.

I have around 30 switches of the same type with CPU at ~10%, only 4 have this “high CPU” issue.

Are they the same model. May be they have a less powerfull CPU and get more stress from the SNMP polls ? May be they have more ports and get more stress from the SNMP polls … etc etc.

I have this issue on some devices, and even on fairly expensives Cisco chassis … It really depends how the SNMP replies are prioritized in the OS of the device, and how the CPU usage is calculated.

Same model, same port count, same OS version, etc.
From reviewing the historic graphs for CPU usage, it seems that it happen after upgrading from 1.3x to 1.4x a couple months back.

Interesting. Same discovery and poller modules loaded etc etc?

Not sure which version was on before, but I have only one server that was upgraded a couple months back. So all the discovery and poller modules will be the same.

I mean on the devices, you can enable/disable modules. Do they all have the same modules activated ?
Sometimes the device has its SNMP code crashed in some way. If you have an opportunity to reload one of the 4 culprits, you can also try that …

That’s the ideas I have so far.

Any Idea what this is set as for default? and if I can manually run the command to see each or this result.

In /opt/librenms (or wherever you installed LibreNMS):
./discovery.php -h xxx
(with xxx being the device id)
You can add “-v” for verbosity and “-d” for debug.

Default values are visible from the GUI (in the “module” list of the device)

I’m experiencing the same issue with an upgrade to one of our procurve switch stacks. Did you manage to resolve this issue Ringo?

nope… was going to upgrade the 3800 firmware on the stack with this issue, I don’t have a change window to do so yet. Just wondering which version / switch you have this issue.

Same here, it’s only a 3800 stack which seems to be affected, both 15.18.0013 and 16.02.0022m.
We came from 15.12.0010 which didn’t seem to have the issue, but I haven’t gone back as it’s one of our core switches.
It’s weird as if you query the OID 1.3.6.1.4.1.11.2.14.11.5.1.9.6.1.0 it shows the correct value, and running a debug discovery as above seems to suggest that it’s getting it’s data from there.

YAML Discovery Data: Array
(
[data] => Array
(
[0] => Array
(
[oid] => hpSwitchCpuStat
[num_oid] => .1.3.6.1.4.1.11.2.14.11.5.1.9.6.1.{{ $index }}
[type] => procurve-fixed
)

    )

)
Data hpSwitchCpuStat: Array
(
[0] => Array
(
[hpSwitchCpuStat] => 9
)

)
Discovered LibreNMS\Device\Processor Array
(
[processor_id] =>
[entPhysicalIndex] => 0
[hrDeviceIndex] => 0
[device_id] => 163
[processor_oid] => .1.3.6.1.4.1.11.2.14.11.5.1.9.6.1.0
[processor_index] => 0
[processor_type] => procurve-fixed
[processor_usage] => 9
[processor_descr] => Processor
[processor_precision] => 1
[processor_perc_warn] => 75
)
SQL[SELECT * FROM processors WHERE device_id=? AND processor_index=? AND processor_type=? [163,“0”,“procurve-fixed”] 0.47ms]
.SQL[SELECT * FROM processors WHERE device_id=? AND processor_index=? AND processor_type=? [163,“0”,“procurve-fixed”] 0.33ms]
.SQL[SELECT * FROM processors WHERE device_id=? AND processor_id NOT IN (?,?) [163,73,73] 0.29ms]
SQL[DELETE T FROM processors T LEFT JOIN devices ON devices.device_id = T.device_id WHERE devices.device_id IS NULL 0.53ms]

Runtime for discovery module ‘processors’: 15.1600 seconds with 486480 bytes
SNMP: [5/14.95s] MySQL: [4/0.00s] RRD: [0/0.00s]

Unload disco module processors

Just to add, I also setup PRTG as a test, autodiscovered the switch and it also displayed the correct values.

I have 3800 stacks that don’t have the issue and a few stacks that do. I don’t think it’s a switch firmware issue. I start seeing this after I upgraded LibreNMS. All my 3800 stacks are on 16.02.0020