Erratic Polling Times (Recently)

arrmo · 10 October 2019 00:08

Hi,

I know some other items can impact this, but in case others have seen this - an interesting change in polling times recently (plot below). And the client is running the same firmware that has been in it for a VERY long time (i.e. year or two, no updates at all).

Thoughts?

Thanks!

murrant · 10 October 2019 04:27

@arrmo just one device, or all of them?

arrmo · 11 October 2019 21:08

Sorry for the delay! Been on travel … :-(.

Happened on several devices, but not all. Of course LibreNMS is being updated (daily), and underlying Linux machine every so often. So it can be either - any thoughts how to debug?

Thanks!

arrmo · 17 October 2019 01:55

BTW, I was curious, so checked the longer term history - not sure what changed, but something did …

I admit, just not sure where to look to help debug this one.

murrant · 17 October 2019 19:54

You need to dig in more, is it a specific kind of devices giving you issues?

My production polling times are relatively stable.

arrmo · 18 October 2019 02:04

Yes, it seems to be a specific OS. The trouble is that it’s not very consistent, so hard to capture of course.

I can run “manual” queries in a loop. Is there a specific command to try out to get more detailed debug information from the poller?

Thanks!

murrant · 18 October 2019 04:02

If you want to get crazy you can record debug info for all polls to log files. These will include times for modules and queries. Just add -d to the wrapper in your cron.

arrmo · 18 October 2019 12:32

OK, seems I did it right (added -d) …

I was able to capture a case where the response was very slow (~ 1 min). It seems to be related to the ports module (i.e. I limited to -m ports). Is there a good place to store the log (i.e. a LibreNMS preferred Pastebin?)?

Thanks!

murrant · 19 October 2019 14:03

No preferred one, but there is https://p.libren.ms

So it definitely sounds like the device acting wonky.

arrmo · 19 October 2019 23:57

Yes, very odd. I worked through it (I think, fingers crossed yet … ;-)). I did a few things,

Pulled LibreNMS code from before the shift in performance. Made no difference
Manually executed the commands that were being sent => found that snmpbulkwalk was causing the issue, so,
Temporarily disabled bulk walk for this OK - seems to work (so far)! Odd that something triggered this (Linux change?) … as the OS on the box hasn’t changed in a VERY long time. But let’s see. If this sticks, I’ll submit a PR for it.

BTW, it would be handy in the debug output to show (in text) the actual (full) snmp command that is executed - just to make sure when manually trying things the command is exactly right. Just a thought.

And - the latest plot,

arrmo · 20 October 2019 22:24

OK, definitely better, but not 100% fixed - plots below from 2 Tomato boxes. Is there a way to have the debug output show timing for each snmp command, not just after “groups” of them? Just to get to the bottom of what is causing this. Thanks!

murrant · 21 October 2019 00:14

Nope, you’ll have to do it the manual way.

arrmo · 21 October 2019 00:34

Will do, thanks. Working through. Definitely much better, but not done yet.

arrmo · 23 October 2019 02:13

Latest update - here is the status after moving to nobulk … clearly a big help. Will submit that, but want to get to the bottom of the remainder, submit the full fix all at once.

arrmo · 27 October 2019 22:52

Ok, bulk seems to help, but the remainder is a bit of a mystery to me yet … . I tested a couple commands in a loop, results as below,

/usr/bin/time -f “%e” sudo -u librenms /opt/librenms/poller.php -h wap-patio -r -v
=> Yep, this one breaks … runs for a while, few seconds each time, but then will jump up to ~ 15-20 sec (and “fail”, given I have a threshold at 15 sec). So then,
/usr/bin/time -f “%e” snmpwalk -v 2c -c public wap-patio
=> Can run this one for hours, always ~ 1.5-2 sec. So even though the poller.php log is pointing at SNMP, it seems not to be at least snmpwalk (of the entire snmp output).

Thoughts?

Thanks!

murrant · 28 October 2019 21:43

Try recording all the snmp commands that happen in one poll and run them.
Also, you forgot the -i option on sudo.

arrmo · 28 October 2019 22:26

Good catch on the -i, thanks! Will fix that.

And agreed, I can easily add in the snmp commands - that’s actually why I was asking if there are output to the logfile somehow (to copy the exactly). You say to record them … is there an easy way to do that?

Thanks again.

murrant · 29 October 2019 00:56

the debug output will show you every command with the exact options. If you also specify -v it won’t hide the snmp credentials.

arrmo · 29 October 2019 01:55

Yep, see that. Was hoping to not have to rebuild all the commands (from the arguments), but that can be done.

Thanks!

arrmo · 3 November 2019 23:16

Well, that was “fun” … LOL. Found it! It was a combination of needing to add nobulk (submitting a PR for that right now), and after that - it seems that if timeout or retries is set, randomly this seems to trigger for items that aren’t in the OID table => remove those, and timing is all clean.

Thanks for the help!