Polling Warning - Devices polling time

jsdurling · 14 September 2017 19:04

I’m wondering if anyone has an idea or can maybe explain how the “polling duration”, along with the “last polled” time affect each other (if they do at all).

I have roughly 600 devices. Of which the polling duration for each device runs from 10s - 200s. I get a “Warning” when I run validate that “Some devices have not been polled in the last 5 minutes”. Of which it shows me a count of roughly 50-60 devices that haven’t been polled in that 5min time frame.

I’m not sure how to go about troubleshooting this. I’ve disabled all un-needed modules. I’ve enabled the per-port-polling on any of the ones that tend to take a long time.

Certainly some of my devices being polled have some latency, just due to locations (remote from the librenms server), but overall latency generally doesn’t exceed 50-100ms, and no packet loss… generally.

My librenms server is not overly taxed, 8-10% of RAM in use, 20-30% CPU usage in general.

I’m just wondering if anyone has further suggestions on trying to troubleshoot this. I’ve run through all the docs and tweaks from the performance section in the documentation.

[root@nms1 librenms]# ./validate.php

Component	Version
LibreNMS	1.31.03-43-gf158a56
DB Schema	206
PHP	7.0.22
MySQL	5.5.52-MariaDB
RRDTool	1.4.8
SNMP	NET-SNMP 5.7.2

====================================

[OK] Database connection successful
[OK] Database schema correct
[WARN] Some devices have not been polled in the last 5 minutes.
You may have performance issues. Check your poll log and see: http://docs.librenms.org/Support/Performance/
sw0.blah.dev
and 54 more…

murrant · 14 September 2017 19:06

That validation simply checks last_polled time and sees if it is under 5 minutes.

Do you have any gaps in your graphs?

jsdurling · 14 September 2017 19:28

Thanks for replying!

I do have some gaps in my graphs, not all of them, but some. It seems to vary as to what devices end up having the gaps, but there are some that will have small gaps.

-Jeff

murrant · 14 September 2017 19:34

Yeah, you are having troubles with performance then.

Did you read through http://docs.librenms.org/Support/Performance/?

You might need to check the poller-wrapper settings for sure. https://docs.librenms.org/Support/Performance/#optimise-poller-wrapper

Eases · 15 September 2017 07:19

A way to monitor/debug the poller.php process, I used, was the graphical command line tool ‘htop’. When started I used the Filter option (F4) and filtered with “poller.php”. This way you can get some idea what is happening with the crontab and the poller every 5 minutes… (I had an extreme slow network switch)

howardjones · 15 September 2017 11:27

I’m getting this warning too (librenms noob here), but the devices in question appear to have up-to-date graphs… so: do I actually have a performance problem? or a spurious warning box problem?

jsdurling · 15 September 2017 11:36

@howardjones I believe @murrant is saying that the warning isn’t always indicative of an issue, unless you are seeing gaps in graphs.

-Jeff

Eases · 15 September 2017 11:43

Hmmm… I guess it is important to have the polling ready within the 300 seconds (5 minutes). Because the new polling process starts all over again and does not skip the pollings which are still in progress.

howardjones · 15 September 2017 11:53

The database entry for the devices (in the list of unpolled devices) says they haven’t been polled in 90 minutes, but the graphs for them clearly show data for that time. Is there some additional logging I can enable to figure out why the database isn’t updated? I guess something is dying before the end of the poll, but far enough in to actually update rrd files?

So, from poller.php -h 12 -d I can see that actually the poller coredumps!

Component: 7
Index:      42262
Peer:       :123
Stratum:    1.3.6.1.4.1.9.9.168.1.2.1.1.9.42262  = 4
Offset:     1.3.6.1.4.1.9.9.168.1.2.1.1.23.42262 = 233
Delay:      1.3.6.1.4.1.9.9.168.1.2.1.1.24.42262 = 92
Dispersion: 1.3.6.1.4.1.9.9.168.1.2.1.1.25.42262 = 1048576

SQL[SELECT `C`.`id`,`C`.`device_id`,`C`.`type`,`C`.`label`,`C`.`status`,`C`.`disabled`,`C`.`ignore`,`C`.`error`,`CP`.`attribute`,`CP`.`value` FROM `component` as `C` LEFT JOIN `component_prefs` as `CP` on `C`.`id`=`CP`.`component` WHERE  ( `device_id` = '12' )] 
Segmentation fault (core dumped)

I also have a similar SQL error in the normal librenms.log:

2017-09-15 12:45:13 MySQL Error: You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near '' at line 1 (SELECT `C`.`id`,`C`.`device_id`,`C`.`type`,`C`.`label`,`C`.`status`,`C`.`disabled`,`C`.`ignore`,`C`.`error`,`CP`.`attribute`,`CP`.`value` FROM `component` as `C` LEFT JOIN `component_prefs` as `CP` on `C`.`id`=`CP`.`component` WHERE )

Which seems to be half a query?

The last thing in the debug output before the crash was related to the ntp poller, so I just disabled that module in the device’s modules page, and now the poller completes, but how would I faultfind that further?

Eases · 15 September 2017 12:00

And what is the output of:
ps -aux | grep "poller.php"

Or more specific if the problem device has ID 12:
ps -aux | grep "poller.php -h 12"

howardjones · 15 September 2017 12:42

Right now, nothing, but a minute ago, perhaps 25 processes, all in pairs created in the last minute that look like:

librenms 20471  0.0  0.0 106116  1152 ?        S    13:05   0:00 /bin/sh -c /usr/bin/env php /opt/librenms/poller.php -h 21 >> /dev/null 2>&1
librenms 20472  2.2  0.2 291392 16940 ?        S    13:05   0:00 php /opt/librenms/poller.php -h 21

Eases · 15 September 2017 13:55

Hmm… Yeah, that’s like it should be.
(I did have more then 15 times the poller.php on the same device running concurrently)

Sorry, I can’t help you further. But there are here many that can.

laf · 15 September 2017 20:20

@howardjones Can you pastebin the output of:

./poller.php -h HOSTNAME -d -r -f

howardjones · 18 September 2017 09:50

Hmmm, not easily. It’s chock full of identifying information (ifAliases, IPs in various formats etc).

The last page is:
https://pastebin.com/sDZJ4WMU
followed by Segmentation Fault

laf · 18 September 2017 17:09

And then a seg fault straight after? Are you sure you don’t have hardware issues?

howardjones · 19 September 2017 08:55

I don’t have hardware

It’s the same VM that I run Cacti on, and have done for several years. I was just trying out a librenms installation side-by-side before possibly moving over. It’s also consistent which devices this is happening with - a few rather old VXRs. The routers are going soon, so it’s not actually a big deal for me, apart from that orange box that keeps popping up!

murrant · 19 September 2017 13:12

Any chance you could send us the data here: https://docs.librenms.org/Support/FAQ/#faq20 for one of the VXRs?

We can then try to re-produce the problem.

howardjones · 19 September 2017 13:37

Yep, sure. Just so I understand, this doesn’t automatically end up in the snmpsim repo, does it? (which is awesome, by the way, just not with production router dumps!)

murrant · 22 September 2017 17:46

I don’t put anything there that isn’t fully sanitized or approved. That being said. You should never send any data you don’t feel comfortable sending. Replacing public IPs an names with bogus data in the capture are a good idea if you are concerned about it.