Polling took longer than 5 minutes! 250 remote locations

hbadreldin · 15 February 2019 16:22

I’m running LibreNMS in AWS to monitor about 250 sites nation wide. In each site I have about 5 switches. I keep getting this message most of the time:

Polling took longer than 5 minutes! This will cause gaps in graphs.

I have performed all the optimization for performance steps mentioned in LibreNMS site. Can someone give me more information on how to optimize my instance for better outcome? can I schedule the polling or something since I have around 900 devices on my network?

I have also change those file permission always, and they come back again asking me to change ownership every few days.

Steps to reproduce an issue.
It happens from time to time, no specifics and not able to repeat it manually.
The output of ./validate.php
[root@nms ec2-user]# /opt/librenms/validate.php
====================================
Component | Version
--------- | -------
LibreNMS | 1.48.1-59-g4504b20
DB Schema | 2019_01_16_195644_add_vrf_id_and_bgpLocalAs (131)
PHP | 7.2.14
MySQL | 5.5.60-MariaDB
RRDTool | 1.4.8
SNMP | NET-SNMP 5.7.2
====================================

[OK] Composer Version: 1.8.4
[OK] Dependencies up-to-date.
[OK] Database connection successful
[OK] Database schema correct
[WARN] Some devices have not been polled in the last 5 minutes. You may have performance issues.
[FIX]:
Check your poll log and see: http://docs.librenms.org/Support/Performance/
Devices:
90100swtennis1.compux.com
92662sw5.compux.com
92175sw4.compux.com
90192sw4.compux.com
91054sw8.compux.com
91099rtr1.compux.com
93083sw2.compux.com
90522sw3.compux.com
10.127.4.12
90632sw2.compux.com
91556swgolfshop.compux.com
90100rtr1.compux.com
and 37 more…
[FAIL] Some folders have incorrect file permissions, this may cause issues.
[FIX]:
sudo chown -R librenms:librenms /opt/librenms
sudo setfacl -d -m g::rwx /opt/librenms/rrd /opt/librenms/logs /opt/librenms/bootstrap/cache/ /opt/librenms/storage/
sudo chmod -R ug=rwX /opt/librenms/rrd /opt/librenms/logs /opt/librenms/bootstrap/cache/ /opt/librenms/storage/
Files:
/opt/librenms/storage/framework/views/cbecc806413997036d8bb74bf909c50d

Thank you,
Hosam

JohnSPeach · 15 February 2019 16:40

You could try distributed pollers.

https://docs.librenms.org/Extensions/Distributed-Poller/

hbadreldin · 15 February 2019 16:43

Thanks for the comment John.

I will not be able to spin more servers/pollers in AWS for this purpose, that is why I have this 23G RAM and 8CPU servers beefy enough so I thought it would be able to handle this work.

murrant · 15 February 2019 19:54

Distributed Poller is not the answer.

Most likely your enemy is network latency. Check how long things are taking to poll, perhaps you need to disable some modules on some devices.

hbadreldin · 15 February 2019 20:34

Here is an example. I have already disabled all the modules that I do not want to use.

/opt/librenms/poller.php 90238swacct1.compuex.com 2019-02-15 14:32:52 - 1 devices polled in 421.3 secs
SNMP [53/416.35s]: Get[21/27.17s] Getnext[0/0.00s] Walk[32/389.18s]
MySQL [181/0.17s]: Cell[213/0.06s] Row[-209/-0.06s] Rows[27/0.02s] Column[1/0.00s] Update[146/0.14s] Insert[3/0.00s] Delete[0/0.00s]
RRD [0/0.00s]: Update[0/0.00s] Create [0/0.00s] Other[0/0.00s]

and here is to another locations:

/opt/librenms/poller.php 91162sw3.compuex.com 2019-02-15 14:32:16 - 1 devices polled in 14.78 secs
SNMP [63/13.09s]: Get[22/2.34s] Getnext[0/0.00s] Walk[41/10.76s]
MySQL [72/0.07s]: Cell[213/0.09s] Row[-209/-0.09s] Rows[26/0.03s] Column[1/0.00s] Update[39/0.03s] Insert[2/0.00s] Delete[0/0.00s]
RRD [0/0.00s]: Update[0/0.00s] Create [0/0.00s] Other[0/0.00s]

Thanks,
Hosam

murrant · 16 February 2019 21:27

SNMP [53/416.35s]

All your time is spent transferring data. The link between the poller and device may be too slow/far.