Poller performance tuning assistance

Hi

We’re having issues with our poller cronjob we’re repeatedly getting this message:

The poller (0a4c1d38ef12) has not completed within the last 5 minutes, check the cron job.

====================================
Component | Version
--------- | -------
LibreNMS  | 1.53.1-33-g3ead46254
DB Schema | 2019_05_30_225937_device_groups_rewrite (135)
PHP       | 7.3.5-1+ubuntu18.04.1+deb.sury.org+1
MySQL     | 10.3.15-MariaDB-1:10.3.15+maria~bionic
RRDTool   | 1.7.0
SNMP      | NET-SNMP 5.7.3
====================================

Total	Up	Down	Ignored	Disabled
Devices	763	757	6	0	0
Ports	28533	14626	10363	0	2700
Services	6	5	1	0	0

I’m seeing very load average on the host and CPU for each process is also high. My cronjob configuration is below:

*/5 * * * *     librenms        . /etc/librenms_environment; /opt/librenms/cronic /opt/librenms/poller-wrapper.py 8 >> /dev/null 2>&1

Host VM:

CPU: 12 vCPU(s)
Memory: 16G

Poll times are ranging from 10-268.86 seconds. I’ve tried disabling modules and playing with the threads but can’t seem to find the magic number. Any recommendations would be great - Any CPU i throw at the machine seems to just get chewed up.

Thanks,

Take a look at the poller logs and see if it’s only some devices that take a long time. Then for that device check which module takes the longest and try disabling it. I have seen for example on a brocade switch port polling taking 5minutes but with ports module disabled the polling took only 10s.

Is the poller in a docker container?

Hi @Elias ,

I have already checked the modules it appears ports and processors is taking the longest and we’d prefer to keep this data - is there anything I can do there to tune?

@Chas yes it is running in a container

Thanks,

Is this a new problem?

Is it the same devices which are not getting polled in time, or all devices?

Did you try increase the threads like poller-wrapper.py 16 ? ( Performance - LibreNMS Docs)

Since the poller is working, but just not improving i would be looking elsewhere

Try run atop on the box and check to see if the cpu is waiting on disk.

Run an MTR trace to some high polled devices to see if you have some issues in the network.

Do you have any resource limits specified in your docker-compose files & check for load stats of different containers

Try run an snmpwalk from the librenms vm, then try it from a different box ideally outside that environment, measure it with the time command. See if it’s quicker :smiley:

Also check this don’t know if its related; The poller has not completed within the last 5 minutes, check the cron job - #13 by Brian_Gibson

Hi Chas,

Thanks for replying,

I bumped to 16 and that seemed to just hammer the CPU even more so rolled back to 8.

checking top gives me the impression it’s not wait:

Tasks: 874 total,  33 running, 841 sleeping,   0 stopped,   0 zombie
%Cpu(s): 89.1 us, 10.3 sy,  0.0 ni,  0.1 id,  0.0 wa,  0.0 hi,  0.6 si,  0.0 st
KiB Mem : 16415952 total,  5795812 free,  7740160 used,  2879980 buff/cache
KiB Swap:  4032488 total,  3202976 free,   829512 used.  8258468 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S   %CPU  %MEM     TIME+ COMMAND
 1316 999       20   0  569560 252304  24164 S 66.776 1.537   0:02.68 php
19276 999       20   0  569560 252536  23896 S 53.618 1.538   0:02.47 php
23253 999       20   0  569560 252464  23828 R 48.355 1.538   0:11.62 php
30038 999       20   0  569560 250616  23696 R 47.368 1.527   0:06.96 php
28492 999       20   0  569560 252516  23880 R 44.737 1.538   0:10.66 php
19227 999       20   0  569560 252732  24108 R 44.079 1.540   0:09.55 php
29158 999       20   0  569560 252776  24148 R 43.092 1.540   0:06.66 php
29475 999       20   0  569560 251972  23876 S 43.092 1.535   0:06.80 php
28605 999       20   0  569560 251488  23896 R 42.763 1.532   0:03.33 php
20469 999       20   0  569560 252436  24044 R 42.434 1.538   0:04.97 php
29217 999       20   0  569560 252784  24152 S 41.776 1.540   0:07.75 php
25963 999       20   0  569560 249836  23832 R 40.789 1.522   0:05.94 php
27202 999       20   0  569560 252736  24100 R 40.461 1.540   0:11.95 php
28724 999       20   0  569560 252392  24072 S 40.132 1.537   0:07.46 php
29911 999       20   0  569560 251112  23896 R 37.171 1.530   0:04.04 php
28076 999       20   0  569560 252488  23848 R 36.513 1.538   0:11.16 php
32511 999       20   0  569560 251704  24024 S 36.513 1.533   0:03.10 php
27830 999       20   0  569560 252652  24012 R 36.184 1.539   0:04.57 php
30209 999       20   0  569560 251800  24120 R 32.237 1.534   0:05.26 php
31468 999       20   0  512220 189492  23888 R 31.908 1.154   0:01.40 php
31724 999       20   0  569560 250792  23868 R 31.579 1.528   0:04.27 php

Doesn’t appear to be linked to that post as poller is completing which suggest crons running fine.

No resource limits imposed in docker:

0a4c1d38ef12        librenms            1150.76%            7.615GiB / 15.66GiB   48.64%              2.49GB / 480MB      7.61GB / 0B         718

Doesn’t appear to be linked to that post as poller is running just not well - I’d assume if it wasn’t running at all the cronjob would be the issue.

Thanks,

With 763 devices, you might need to add additional pollers. 1 might not be able to keep up with the load.

Hi @Wolfraider ,

Apologies bit of a novice here on Librenms how do I go about doing that?

I’ve managed to reduce load by dropping threads to 2 which currently seems stable however I’ll check again in the morning as I can assume this will impact my polling times. So currently 12 CPU showing 24 Poller processes (I presume that is correct).

Thanks,

@lukayeh check the docs https://docs.librenms.org/Extensions/Distributed-Poller/