Erratic Polling Times (Recently)

Hi,

I know some other items can impact this, but in case others have seen this - an interesting change in polling times recently (plot below). And the client is running the same firmware that has been in it for a VERY long time (i.e. year or two, no updates at all).

image

Thoughts?

Thanks!

@arrmo just one device, or all of them?

Sorry for the delay! Been on travel … :-(.

Happened on several devices, but not all. Of course LibreNMS is being updated (daily), and underlying Linux machine every so often. So it can be either - any thoughts how to debug?

Thanks!

BTW, I was curious, so checked the longer term history - not sure what changed, but something did … :smile:

image
image

I admit, just not sure where to look to help debug this one.

You need to dig in more, is it a specific kind of devices giving you issues?

My production polling times are relatively stable.

Yes, it seems to be a specific OS. The trouble is that it’s not very consistent, so hard to capture of course.

I can run “manual” queries in a loop. Is there a specific command to try out to get more detailed debug information from the poller?

Thanks!

If you want to get crazy you can record debug info for all polls to log files. These will include times for modules and queries. Just add -d to the wrapper in your cron.

OK, seems I did it right (added -d) … :smile:

I was able to capture a case where the response was very slow (~ 1 min). It seems to be related to the ports module (i.e. I limited to -m ports). Is there a good place to store the log (i.e. a LibreNMS preferred Pastebin?)?

Thanks!

No preferred one, but there is https://p.libren.ms

So it definitely sounds like the device acting wonky.

Yes, very odd. I worked through it (I think, fingers crossed yet … ;-)). I did a few things,

  1. Pulled LibreNMS code from before the shift in performance. Made no difference
  2. Manually executed the commands that were being sent => found that snmpbulkwalk was causing the issue, so,
  3. Temporarily disabled bulk walk for this OK - seems to work (so far)! Odd that something triggered this (Linux change?) … as the OS on the box hasn’t changed in a VERY long time. But let’s see. If this sticks, I’ll submit a PR for it.

BTW, it would be handy in the debug output to show (in text) the actual (full) snmp command that is executed - just to make sure when manually trying things the command is exactly right. Just a thought.

And - the latest plot,

OK, definitely better, but not 100% fixed - plots below from 2 Tomato boxes. Is there a way to have the debug output show timing for each snmp command, not just after “groups” of them? Just to get to the bottom of what is causing this. Thanks!


Nope, you’ll have to do it the manual way.

Will do, thanks. Working through. Definitely much better, but not done yet.

Latest update - here is the status after moving to nobulk … clearly a big help. Will submit that, but want to get to the bottom of the remainder, submit the full fix all at once.

Ok, bulk seems to help, but the remainder is a bit of a mystery to me yet … :frowning:. I tested a couple commands in a loop, results as below,

  1. /usr/bin/time -f “%e” sudo -u librenms /opt/librenms/poller.php -h wap-patio -r -v
    => Yep, this one breaks … runs for a while, few seconds each time, but then will jump up to ~ 15-20 sec (and “fail”, given I have a threshold at 15 sec). So then,
  2. /usr/bin/time -f “%e” snmpwalk -v 2c -c public wap-patio
    => Can run this one for hours, always ~ 1.5-2 sec. So even though the poller.php log is pointing at SNMP, it seems not to be at least snmpwalk (of the entire snmp output).

Thoughts?

Thanks!

Try recording all the snmp commands that happen in one poll and run them.
Also, you forgot the -i option on sudo.

Good catch on the -i, thanks! Will fix that.

And agreed, I can easily add in the snmp commands - that’s actually why I was asking if there are output to the logfile somehow (to copy the exactly). You say to record them … is there an easy way to do that?

Thanks again.

the debug output will show you every command with the exact options. If you also specify -v it won’t hide the snmp credentials.

Yep, see that. Was hoping to not have to rebuild all the commands (from the arguments), but that can be done.

Thanks!

Well, that was “fun” … LOL. Found it! It was a combination of needing to add nobulk (submitting a PR for that right now), and after that - it seems that if timeout or retries is set, randomly this seems to trigger for items that aren’t in the OID table => remove those, and timing is all clean.

Thanks for the help!