Polling took longer than 5 minutes! This will cause gaps in graphs

Leandro_Roggerone · 30 June 2021 19:16

Hi , guys
My lnms platform was working great until few weeks ago.
I noticed some gaps on all existing graphs.
I also noticed the message “Polling took longer than 5 minutes! This will cause gaps in graphs.” listed very often in event messages.
There are two main responsable devices that brings a lot of data and are constantly mentioned in above messages.

So …
Is there a proper way to debug this?
First , I would look for a hardware bottleneck.
Cpu , memmory and storage seems ok when looking at localhost kpis.}
Since it is a vm running on proxmox box , I also checked vm parameters from proxmox panel and everithing seems to be ok.
Other idea ? Please provide.

Second, I would try loking into the platform.
a ) Already tryed duplicating all poller workers on “global settings->poller->distributed pollers”.
But nothing changed.
Is ok doing this ?
Is it possible to assign certain amount of dedicated pollers to specific device ?

1 remove unused data.
I will analize problematic devices and try to avoid get unused data , and disable unuded poller modules.

3 change polling time:
This is last option I would like to try .
For me, it is ok ot use 5 mins as default poller time.
I can increase this to 10mins , and perhaps gaps will disapear , but … prefer to keep using 5 mins.

Ok … Any idea , debuging this would be wellcome.
This is my validate.php output.

bash-4.2$ ./validate.php
====================================
Component | Version
--------- | -------
LibreNMS  | 21.6.0-16-g131f5c7
DB Schema | 2021_06_07_123600_create_sessions_table (211)
PHP       | 7.3.27
Python    | 3.6.8
MySQL     | 10.5.9-MariaDB
RRDTool   | 1.4.8
SNMP      | NET-SNMP 5.7.2
====================================

[OK]    Composer Version: 2.1.3
[OK]    Dependencies up-to-date.
[OK]    Database connection successful
[OK]    Database schema correct

Jellyfrog · 1 July 2021 14:54

Adjust number of poller threads, check Performance - LibreNMS Docs

John_Shrader · 1 July 2021 18:40

We had a similar issue a few months ago. We use DNS names to identify all devices and our DNS server started having problems keeping up. Installing pdns-recursor and pointing Librenms to it fixed it for our install.

Leandro_Roggerone · 1 July 2021 18:49

good to know !!!
thanks.
in my case , im not using name for hosts , just ip raw.

rhinoau · 2 July 2021 07:48

In addition to this, in my case (and depending on device types) the SNMP Max repeaters made a huge difference - if you have long polling on some devices - can see it all under all devices: /devices/format=graph_poller_perf/from=-24hour/to=now/

See my SNMP max repeaters results on this post and the one after it: Debugging graph spikes from high latency links - #6 by rhinoau

Fiddling modules on and off never made much different to my poller timings.

Leandro_Roggerone · 2 July 2021 16:50

Nice , data.
I will also test and share my results.
Thanks.

Adam_Cadd · 5 July 2021 01:30

I had a similar issue,
In my case I had too many devices for the host to handle during polling and it appeared as though it was start polling again before finishing the current polling.
To alleviate this, I set up distributed polling onto a bunch of Raspberry Pis (I had a surplus) and then assigned devices (by site) to these, this can also be achieved with Docker containers

kalamchi75 · 8 July 2021 10:12

Hi Leonardo,

First you might want to check those suspected devices rather than LNMS for load and other performance issues that might cause them to respond to the poller slower than before.
I would also go with Jellyfrog’s suggestion above and adjust the poller threads. You might wanna do that few times until you reach a good balance.
I would also go with Adam’s suggestion on creating distributed pollers closer to the problematic machines (if they are in a different location that requires longer path/latency). Then assign your problematic machines to be polled by those distributed pollers.

As you can see above, our master poller is polling 240+ devices in our facility in under 100 sec. While our distributed pollers, physically located in a remote office, are polling 40+ devices in that remote office in under 200 sec. So the 5 minutes poll cycle is met in both cases.
Before setting up the distributed pollers, I had really issues polling the remote devices and caused a lot of gaps as the poller cycles were not able to finish, and might overlap, causing other load and memory issues in LNMS.

Best Regards

Leandro_Roggerone · 8 July 2021 12:12

Hi @kalamchi75 , glad you write to me.
I toke all your suggestions and improved my graphs a lot.
I have never paid attention to poller section / performance before , now I understand it impotance.
About grouping problematic machines … this is my next goal.
In fact I posted a forum questiong regarding poller groups.
I just need to create 2 o 3 pollers on same server and asign devices according to specific criteria.
Then I would like to set poller time interval on those pollers.
Can you provide some example / configs for that ?

kalamchi75 · 9 July 2021 08:27

Hi Leandro,

I am not sure how you would create pollers on the same server itself.
In my case, I have my remotes pollers physically located at the remote office. They poll the remote devices there and send them to the master.

Below is the configuration of the distributed on the master server:

 # Enable Distributed Pollers
 $config['distributed_poller'] = true;
 $config['rrdcached']    = "localhost:42217";
 $config['distributed_poller_memcached_host'] = 'localhost';
 $config['distributed_poller_memcached_port'] = '11211';

And below are the bits of configuration required on the remote poller:
1- You need to point the database connection to your master poller DB. Make sure you allow the privileges for the DB user if the connection is coming from a different IP address (assuming the remote poller is on a different server with a different IP address in my case):

$config['db_host'] = 'librenms-master.xxx.xxx';
$config['db_port'] = '3306';
$config['db_user'] = 'librenms';
$config['db_pass'] = 'xxxxxxxxxx';
$config['db_name'] = 'librenms';

Now, the distributed poller config lines:

$config['distributed_poller_name']           = 'librenms-poller01';
$config['distributed_poller_group']          = '2';
$config['distributed_poller_memcached_host'] = "librenms-master.xxx.xxx";
$config['distributed_poller_memcached_port'] = 11211;
$config['distributed_poller']                = true;
$config['rrdcached']                         = "librenms-master.xxx.xxx:42217";
$config['update']                            = 0;

You might also want to install rrdcached and memcached on the master poller as it is required if you want to use distributed pollers. As you can see above, you will need to point your distributed poller to the memcached and rrdcached host/ports

Once this is done, the distributed poller should pop automatically in the GUI.
Now, notice the line:

$config['distributed_poller_group']          = '2';

This is the group number (ID) that your distributed poller(s) are assigned to. It can be any number. You will use this number to create your poller groups in the GUI. See below, ID 2

So if you have more that one distributed poller, you can group them by configuring the same number. or you can split them into multiple groups by assigning different numbers.

The last thing is that you specify which Poller group would poll a device in the device’s config in the GUI.

I hope this was helpful

Best Regards

PipoCanaja · 10 July 2021 15:20

@Leandro_Roggerone
Polling interval cannot be changed within a LibreNMS deployment, even on multiple pollers. This is not possible without adding/changing a lot of code.
Of course, pathes are accepted
Bye

Leandro_Roggerone · 12 July 2021 11:22

thank you @kalamchi75 .
I apreciate your detailed answer.
However this is not what I want to accomplish now.
For the moment I have a single device architecture.
Hope it helps to someone trying to create a distributed scheme.
Leandro.

Leandro_Roggerone · 12 July 2021 11:28

@PipoCanaja :
Understood about having different polling intervals, it is not possible currently.
So, is it good idea to isolate heavy loaded devices into separeted pollers ?
Im talkin about creating multiple pollers on same machine.
At least if this poller fails or , device within it takes too long to respond , It would not break / create gaps in other device´s graphs.
Does it make sense ?
Regards.
Leandro.

Leandro_Roggerone · 12 July 2021 11:53

The polling time depends on the quantity and nature of the device you are polling.
In my case I have 2 very problematic devices.
Huawei main core, with a lot of interfaces , routes and bgp peers.
And.
Huawei OLT , with a lot of interface (it creates one per gpon interface), processors , health , and so …
I would like to handle them in a different way from the rest of the devices, thats why I wan to create poller groups, but allwais on same machine.
Leandro.

kalamchi75 · 12 July 2021 12:32

You are most welcome, Leandro.
Unfortunately I haven’t setup multiple pollers on the same server, not sure how is this achieved.
Are using a VM or a bare-metal server for your LNMS ?
If you are using a VM, then I would recommend using a second VM as a distributed poller. It might make your deployment easier since you don’t have to fiddle with multi-poller single-sever situation.
Just a thought !

Best Regards

Leandro_Roggerone · 12 July 2021 15:08

@kalamchi75 , my lnms is running on VM (under proxmox).
So yes , I can try creating a second poller on a new vm , that is a great idea.
I will try a little bit more creating one or two more pollers on same machine.
If can not get it working I will follow your advice.
Leandro.

kalamchi75 · 12 July 2021 15:09

Good luck.
If you manage to make two separate pollers work on the same server, kindly share the procedure.

Best regards

Leandro_Roggerone · 12 July 2021 15:11

of course , I will:
There is an open post :
here
I will also try to make it work on a testing enviroment.
Regards.

PipoCanaja · 12 July 2021 17:31

Hi @Leandro_Roggerone
Yes, having multiple pollers will ensure that if one poller is failing, it won’t impact devices that are handled by the other.
For that purpose, the pollers must run in different VMs. Cause if they are on same, and if one of the pollers is taking more than $polling-interval to do its job, it will start killing the machine sooner or later, which will impact the other poller running on the same VM.
So I would not even spend time on trying to run the pollers on the same VM.

kalamchi75 · 13 July 2021 07:17

@Leandro_Roggerone I just noticed that i’ve been calling you Leonardo instead of Leandro the whole time.
Sorry man.

I should learn to use the @ more often hhh.

All the best