Poller performance

LoZio · 14 March 2018 07:24

It all works well, but I have about 800 devices on a WAN. To poll all of them in the 5 minute interval I went up with the poller concurrency and added processors to the server. Now I have a 6 core server that is 100% full for about 4:30 minutes every 5. Also it generates some 20k “sessions” on my external firewall while polling.
I migrated from Cacti where a server with a single CPU copes with the load with no problems and creates no problems on the firewall.
I followed the tuning guides but found nothing significant. Is there a “real” poller binary like cacti/spine to support real world setups? I saw a feature request with no answers (Implement high performance asynchronous SNMP poller).
Thanks

laf · 14 March 2018 08:15

We poll a lot more than cacti does by default. Have you gone through the performance doc?

nktl · 14 March 2018 08:27

Poller definitively is a weak spot - something that could efficiently handle multiple concurrent in-flight request per device would be a dream (like in AKIPS poller).

For my setup I use Poller Service now (in standalone, not distributed mode):

https://laf.github.io/docs/Extensions/Poller-Service/

Not sure if this thing is still maintained, but it does work, assuming you get the DB configuration quirk correctly.

It is still effectively a wrapper for poller.php, but it runs continuously as a service (as opposed to being fired up from crontab as in standard configuration) and seems to handle worker threads more efficiently.

I have got around 100 devices polled every minute and with load average around 3.0 (6 cores machine). No firewall here.

As all this polling code is PHP, make sure you run the latest PHP 7.2 with opcache for CLI enabled - there are a lot of optimizations in the latest version of PHP that reduce CPU usage visibly and improve performance.

LoZio · 14 March 2018 08:31

I understand you poll more and different that Cacti and the result is a better product, this is why I switched.
I run through the optimization doc and I’m still testing with the various parameters since it takes trial and error but you can achieve some improvement, not a tenfold one that is what I need, particularly about the concurrent sessions on the firewall.
I saw your (?) post about a faster poller and asked a question in that thread.

nktl · 14 March 2018 08:39

10-opcache.ini configuration that improves poller performance (with PHP7.2):

zend_extension=opcache
opcache.enable=1
opcache.enable_cli=1
opcache.file_cache="/tmp/cache/"
;Make sure /tmp/cache/ exists and is writable
opcache.file_cache_only=0
opcache.file_cache_consistency_checks=1
opcache.memory_consumption=256

The problem with poller.php is that unlike LibreNMS webinterface (that uses continuously running php-fpm handler) it just gets fired up from commandline for every poll job, spawning new php interpretter which gets killed at the end of it - so opcode caching is not preserved between runs, unless file-based cache is enabled, as per above.

I would say this change alone decreases CPU usage by around 10-20% (+ PHP7.2 is much more efficient by itself).

LoZio · 14 March 2018 10:26

Thanks for your suggestions. I upgraded to 7.2, opcache is fully working (98% hit rate with 4% of memory used) but this seems to change very little if something at all. I’m looking at poll times and cpu utilization and all have the previous values. Will wait some time for it to settle and generate some graphs.

LoZio · 14 March 2018 12:26

After the hole the new setup with 7.2 vs preceding data with 7.0.
PHP itself is not the magic bullet, not even a bullet actually…

Kevin_Krumm · 14 March 2018 14:37

Please don’t use the laf.Github.io docs. Use the official docs.librenms.org

Kevin_Krumm · 14 March 2018 14:38

What exactly have you done to improve performance out of the https://docs.librenms.org/#Support/Performance/

LoZio · 14 March 2018 14:55

The things that nktl told above. Basically upgrading PHP to 7.2 with no success in terms of performance, the upgrade went ok and LibreNMS works as expected.
About the doc I did all the configurations: rrdcached was there since day one, I’m improving MySQL in terms of i/o and memory usage day after day, I changed snmp parameters and found that basically it is the same as defaults that works best.
When the poller is running I have high activity (cpu and i/o) from rrdcached, but nothing that can be optimized further or in a documented way.
Machine top shows that several GB of memory are used for buffering and no swap is in use.

Kevin_Krumm · 14 March 2018 14:59

There are still more steps in the https://docs.librenms.org/#Support/Performance/ that you need to try.

Also, check your poller log see what devices are taking the longest to poll.

LoZio · 14 March 2018 15:07

Kevin I said I read and applied everything useful from that doc, I just did not write in the response the last 15 days of test.
For longest polling devices I set specific SNMP parameters to lower the poll times, with little success. Some devices are slow by itself, other are slow only via LibreNMS, using the command line to issue the snmpbulk results in 5/10 seconds poll, in the poller log I saw even 150 seconds but I think it is due to the fact that polling in parallel seven hundreds devices is not the same that scanning a single host.

Kevin_Krumm · 14 March 2018 15:10

Okay well, im just trying to help you and see what exactly you have tried.
If was me I would see what devices are taking longest to poll then check the poll log for those devices and see what modules are taking the longest possible disable them for those devices. If its the ports maybe you could try per port polling.

laf · 14 March 2018 15:28

It’s not but it’s definitely not a 10 fold difference.

I’ve never seen snmp max oids and snmp repeaters make no difference at all, have you done the manual tests like the docs say?

LoZio · 14 March 2018 15:55

Yes I wrote it above. I didn’t say it does not make any difference, I resorted to use the defaults because they are good in my setup. Also having to poll WAN devices sometimes it is the network itself that injects some latency so the test are not 100% repeatable and having 4,5 or 5,5 seconds for a single device is the same. I have a +/- 30 sec variance even without changing the parameters with an average 230 seconds now for each poll cycle. Going to 200s would be good, but not resolutive. Going to 30s would be nice.
The biggest problem is the number of sessions on the firewall, I thought that changing the snmp bulk parameters could have some effects but in practice it has not, I have about 20K sessions going out the LibreNMS server.
This alone is a problem: firewall runs out of available sessions, IDS tells we’re going to die and all the hackers in the world are having a meetup in out LAN, and so on…

nktl · 14 March 2018 16:00

Reduce session timeout on your firewall for SNMP traffic to something like 10s - so it gets recycled instantly when there is no traffic. I suppose the default could be 3600s, which would keep these old session lingering in the table for an hour (or possibly more)

LoZio · 14 March 2018 16:03

Applying this suggestion right now, but there are not a lot of problematic devices with times out of scale, the majority have times between 1 and 10 seconds.

LoZio · 14 March 2018 16:06

@nktl I also thought about this, but we have about 500 servers and dozens of apps behind the firewall, I just can’t change some parameters and hope everything still works ok. This setup is the results of years of fine tuning between load balancers, servers, timeouts, …

nktl · 14 March 2018 16:52

Assuming we are talking about any half-decent firewall, you can decrease session timeout for SNMP traffic only, it won’t affect anything else.

The default session timeout for SNMP/UDP should be around 30s-60s anyway, possibly someone was playing with this option at some point and increased it excessively - resulting with session bloat you are witnessing now.

Would be good to confirm existing configuration at least.

LoZio · 15 March 2018 08:08

The thing I think is strange is that we have several systems that monitor our network, some polling the same devices and a lot more others than LibreNMS. We have Cacti, Nagios, Zabbix, CA Spectrum and probably others I don’t remember right now and just to limit the world to SNMP. I disabled each of these systems one by one to see the difference in sessions on the firewall and the biggest difference was with CA Spectrum at about 200 sessions. LibreNMS itself weighs in with over 20000 (actually the last disable-it test cut 25k sessions in some minutes). So I’m going to ask FW guys some advice, but their response is easy to guess and for a good reason. It takes minutes for sessions to drop if I disable the LibreNMS poller.