High CPU usage

Ringo · 10 January 2019 16:16

Trying to fix an issue where some (4) HPE switches are reporting High CPU usage (>80%), when I verify with show system info on switch, the CPU utilization is <10%.
And when i use snmpget from the same librenms server with same OID, I get <10.

From debug:
Attempting to initialize OS: procurve

OS initilized as Generic

SQL[SELECT * FROM processors WHERE device_id=? [90] 0.74ms]

SNMP[’/usr/bin/snmpget’ ‘-v2c’ ‘-c’ 'COMMUNITY ‘-OUQn’ ‘-M’ ‘/opt/librenms/mibs:/opt/librenms/mibs/hp’ ‘-t’ ‘15’ ‘-r’ ‘3’ ‘udp:HOSTNAME:161’ ‘.1.3.6.1.4.1.11.2.14.11.5.1.9.6.1.0’]

.... = 95

array (
‘.1.3.6.1.4.1.11.2.14.11.5.1.9.6.1.0’ => ‘95’,
)

95%

From server:
root@netm:/opt/librenms# snmpget -v 2c -c public-OUQn -t 15 -r 3 HOSTNAME .1.3.6.1.4.1.11.2.14.11.5.1.9.6.1.0

.1.3.6.1.4.1.11.2.14.11.5.1.9.6.1.0 = 0

./validate.php

====================================

Component	Version
LibreNMS	1.47-23-g05458c006
DB Schema	279
PHP	7.0.33-0+deb9u1
MySQL	10.1.26-MariaDB-0+deb9u1
RRDTool	1.6.0
SNMP	NET-SNMP 5.7.3

====================================

[OK] Composer Version: 1.8.0
[OK] Dependencies up-to-date.
[OK] Database connection successful
[OK] Database schema correct
[WARN] Your install is over 24 hours out of date, last update: Wed, 09 Jan 2019 12:29:22 +0000
[FIX]:
Make sure your daily.sh cron is running and run ./daily.sh by hand to see if there are any errors.

Kevin_Krumm · 10 January 2019 16:18

thats not supposed to happen?

PipoCanaja · 10 January 2019 16:34

Hi,

A reason might be the polling itself. Depending on how the switch calculates the CPU usage (instant or averaged) the high amount of snmp requests is in fact creating a peak at each and every poll. So the librenms graph will show 80% flat whereas the real graph should be 80% peak at each poll and 10% average the rest of the time.
Unfortunately, there is no easy way to avoid this bias. And the peak load is a real peak of CPU usage so it is stil interesting to know it happened …

PipoCanaja

Ringo · 10 January 2019 16:35

How would I go about fixing that?

Ringo · 10 January 2019 16:35

I check other switches that are same type without this issue, same “OS initilized as Generic” was in the output.

Ringo · 10 January 2019 16:41

I have around 30 switches of the same type with CPU at ~10%, only 4 have this “high CPU” issue.

PipoCanaja · 10 January 2019 16:44

Are they the same model. May be they have a less powerfull CPU and get more stress from the SNMP polls ? May be they have more ports and get more stress from the SNMP polls … etc etc.

I have this issue on some devices, and even on fairly expensives Cisco chassis … It really depends how the SNMP replies are prioritized in the OS of the device, and how the CPU usage is calculated.

Ringo · 10 January 2019 16:52

Same model, same port count, same OS version, etc.
From reviewing the historic graphs for CPU usage, it seems that it happen after upgrading from 1.3x to 1.4x a couple months back.

PipoCanaja · 10 January 2019 17:17

Interesting. Same discovery and poller modules loaded etc etc?

Ringo · 10 January 2019 17:24

Not sure which version was on before, but I have only one server that was upgraded a couple months back. So all the discovery and poller modules will be the same.

PipoCanaja · 10 January 2019 17:26

I mean on the devices, you can enable/disable modules. Do they all have the same modules activated ?
Sometimes the device has its SNMP code crashed in some way. If you have an opportunity to reload one of the 4 culprits, you can also try that …

That’s the ideas I have so far.

Ringo · 11 January 2019 21:14

Any Idea what this is set as for default? and if I can manually run the command to see each or this result.

PipoCanaja · 12 January 2019 03:06

In /opt/librenms (or wherever you installed LibreNMS):
./discovery.php -h xxx
(with xxx being the device id)
You can add “-v” for verbosity and “-d” for debug.

Default values are visible from the GUI (in the “module” list of the device)

npoll · 31 January 2019 16:05

I’m experiencing the same issue with an upgrade to one of our procurve switch stacks. Did you manage to resolve this issue Ringo?

Ringo · 31 January 2019 16:20

nope… was going to upgrade the 3800 firmware on the stack with this issue, I don’t have a change window to do so yet. Just wondering which version / switch you have this issue.

npoll · 31 January 2019 16:28

Same here, it’s only a 3800 stack which seems to be affected, both 15.18.0013 and 16.02.0022m.
We came from 15.12.0010 which didn’t seem to have the issue, but I haven’t gone back as it’s one of our core switches.
It’s weird as if you query the OID 1.3.6.1.4.1.11.2.14.11.5.1.9.6.1.0 it shows the correct value, and running a debug discovery as above seems to suggest that it’s getting it’s data from there.

YAML Discovery Data: Array
(
[data] => Array
(
[0] => Array
(
[oid] => hpSwitchCpuStat
[num_oid] => .1.3.6.1.4.1.11.2.14.11.5.1.9.6.1.{{ $index }}
[type] => procurve-fixed
)

)
Data hpSwitchCpuStat: Array
(
[0] => Array
(
[hpSwitchCpuStat] => 9
)

)
Discovered LibreNMS\Device\Processor Array
(
[processor_id] =>
[entPhysicalIndex] => 0
[hrDeviceIndex] => 0
[device_id] => 163
[processor_oid] => .1.3.6.1.4.1.11.2.14.11.5.1.9.6.1.0
[processor_index] => 0
[processor_type] => procurve-fixed
[processor_usage] => 9
[processor_descr] => Processor
[processor_precision] => 1
[processor_perc_warn] => 75
)
SQL[SELECT * FROM processors WHERE device_id=? AND processor_index=? AND processor_type=? [163,“0”,“procurve-fixed”] 0.47ms]
.SQL[SELECT * FROM processors WHERE device_id=? AND processor_index=? AND processor_type=? [163,“0”,“procurve-fixed”] 0.33ms]
.SQL[SELECT * FROM processors WHERE device_id=? AND processor_id NOT IN (?,?) [163,73,73] 0.29ms]
SQL[DELETE T FROM processors T LEFT JOIN devices ON devices.device_id = T.device_id WHERE devices.device_id IS NULL 0.53ms]

Runtime for discovery module ‘processors’: 15.1600 seconds with 486480 bytes
SNMP: [5/14.95s] MySQL: [4/0.00s] RRD: [0/0.00s]

Unload disco module processors

npoll · 31 January 2019 16:32

Just to add, I also setup PRTG as a test, autodiscovered the switch and it also displayed the correct values.

Ringo · 31 January 2019 16:35

I have 3800 stacks that don’t have the issue and a few stacks that do. I don’t think it’s a switch firmware issue. I start seeing this after I upgraded LibreNMS. All my 3800 stacks are on 16.02.0020