V1-63 High CPU on arista switches with BGP peers module enabled

kzaykov · 5 May 2020 10:06

Hi All.

I follow monthly updates, and the last one was last week.

Today I have noticed that bunch of Arista switches are polled the way too slow and snmpd starves the CPU on them.

#bash top

top - 09:54:09 up 313 days,  6:35,  2 users,  load average: 0.81, 0.99, 1.33
Tasks: 331 total,   2 running, 329 sleeping,   0 stopped,   0 zombie
%Cpu(s): 29.7 us,  1.2 sy,  0.0 ni, 67.5 id,  0.0 wa,  0.6 hi,  0.9 si,  0.0 st
KiB Mem:  32561636 total,  8648684 used, 23912952 free,   404480 buffers
KiB Swap:        0 total,        0 used,        0 free,  3427984 cached

  PID USER      PR  NI  VIRT  RES  SHR S  %CPU %MEM    TIME+  COMMAND                                                                                                                                                                                                           
 4101 root      20   0  979m 972m 5888 R  99.2  3.1   6474:19 snmpd

I see timeouts and slow response for bgp-peer module:

>> Runtime for poller module 'bgp-peers': 388.0714 seconds with 269376 bytes
>> SNMP: [104/387.35s] MySQL: [291/0.52s] RRD: [406/0.10s]
#### Unload poller module bgp-peers ####

kzaykov · 5 May 2020 10:07

it looks like that

SNMP['/usr/bin/snmpget' '-v2c' '-c' 'COMMUNITY' '-OQUs' '-m' 'ARISTA-BGP4V2-MIB' '-M' '/opt/librenms/mibs:/opt/librenms/mibs/arista' 'udp:HOSTNAME:161' 'aristaBgp4V2PeerState.1.2.16.32.1.13.245.184.0.187.0.0.0.0.3.41.52.0.4' 'aristaBgp4V2PeerAdminStatus.1.2.16.32.1.13.245.184.0.187.0.0.0.0.3.41.52.0.4' 'aristaBgp4V2PeerInUpdates.1.2.16.32.1.13.245.184.0.187.0.0.0.0.3.41.52.0.4' 'aristaBgp4V2PeerOutUpdates.1.2.16.32.1.13.245.184.0.187.0.0.0.0.3.41.52.0.4' 'aristaBgp4V2PeerInTotalMessages.1.2.16.32.1.13.245.184.0.187.0.0.0.0.3.41.52.0.4' 'aristaBgp4V2PeerOutTotalMessages.1.2.16.32.1.13.245.184.0.187.0.0.0.0.3.41.52.0.4' 'aristaBgp4V2PeerFsmEstablishedTime.1.2.16.32.1.13.245.184.0.187.0.0.0.0.3.41.52.0.4' 'aristaBgp4V2PeerInUpdatesElapsedTime.1.2.16.32.1.13.245.184.0.187.0.0.0.0.3.41.52.0.4' 'aristaBgp4V2PeerLocalAddr.1.2.16.32.1.13.245.184.0.187.0.0.0.0.3.41.52.0.4' 'aristaBgp4V2PeerLastErrorCodeReceived.1.2.16.32.1.13.245.184.0.187.0.0.0.0.3.41.52.0.4' 'aristaBgp4V2PeerLastErrorSubCodeReceived.1.2.16.32.1.13.245.184.0.187.0.0.0.0.3.41.52.0.4' 'aristaBgp4V2PeerLastErrorReceivedText.1.2.16.32.1.13.245.184.0.187.0.0.0.0.3.41.52.0.4']
Timeout: No Response from udp:device:161.```

kzaykov · 5 May 2020 11:04

@murrant Hi Tony,
please take a look. seems like a bug introduced in 1.63 release.

It perfectly correlates with running the update:

-rw-rw-r--.  1 librenms librenms    371503 Apr 29 12:04 daily.log-20200430

murrant · 5 May 2020 13:48

https://github.com/librenms/librenms/pull/11424 could it be this?

murrant · 5 May 2020 13:50

Also, that very much looks like a leak of some type on the device. Have you tried updating the firmware/contacting the vendor?

kzaykov · 5 May 2020 15:22

We have indeed, no news so far. But different platforms, SW affected.
7020SR
7280SR
7280QR
And EOS:
4.21.6F
4.21.5F
4.21.8M
4.21.3F
4.22.0F

What we have also tried is to downgrade one poller to 1.62 and it doesn’t help ((

kedare · 5 May 2020 15:31

Hmm this should not cause any load issue (the change in polling code is quite small and nothing intensive)

On my side I didn’t had any negative impact on our Arista platforms

7280SRA (4.19.5M)
7280SR (4.20.8M )
7508 (4.20.8M)
7508N (4.20.8M)

Do you have the same spikes when you disable bgp-peers module ?
If you downgraded the poller from before the BGP error code management then it’s probably something else.
Have you tried restarting the snmpd process ?

kzaykov · 5 May 2020 15:49

Yep, seems downgrade and restart snmpd on switches did the trick.
So should be a bug in EOS 4.21.x - 4.22.0

Update…
nope. it returns after a while

murrant · 5 May 2020 18:05

Would be great if you could narrow down a little more what is causing the issue.

Clearly it is a firmware bug, but maybe LibreNMS could work around it.

Damon_Reed · 5 May 2020 19:38

We just this week upgraded a large number of our Arista devices from 4.21.8M to 4.22.5M (version recommended to us by Arista support). We track librenms master manually and just updated last week, including the BGP polling updates in PR 11424.

I’m not seeing a change in the polling times on our 7280SRs. Everything is pretty steady before and after the librenms update and the Arista update.

My read of the attached graph gave me the impression the ports module was the one ramping up in execute time, but that could be an effect of the CPU drag vs the size of the SNMP port table.

kzaykov · 6 May 2020 07:38

seems like a memory leak:

affected device

#sh proc top once | i nmp
 3868 root      20   0 1429m 1.4g 5624 S   0.0  4.5   6871:28 snmpd
27637 root      20   0  787m 396m 201m S   0.0  1.2   1:18.88 Snmp

not affected device

#sh proc top once | i nmp
 2954 root      20   0  731m 368m 196m S   0.0  1.2 215:54.55 Snmp
 3851 root      20   0 16384 9500 6048 S   0.0  0.0  41:55.50 snmpd

disabling bgp peers and restarting both processes work. at least devices workarounded yesterday look stable, no CPU spikes and memory consumed by snmp doesn’t raise.

kzaykov · 8 June 2020 15:52

Arista has confirmed a bug. It is reproduced if SNMP MSS configured for a relatively small value (< 1500b in our case) and big amount of SNMP data is being transmitted (e.g. BGP module in place).

The fix for BUG484976 will be made available in the next upcoming maintenance releases in the 4.21/4.22 code train.