Problems going over 32 poller threads

Ben_Gulledge · 22 May 2019 13:13

All, need help solving a very bizarre problem I am having increasing my threads beyond 32.

I have 4 pollers in 1 group, I am testing scaling as we will need to poll 20,000 devices. Mysql/rrdcache/memcache are running together on 1 server.

I maxed out the initial 2 pollers we had, started with the base install, so 16 threads on 2 cores (8 threads/core), later went to 4 pollers, than went from 2 cpu to 8 cpu on each. I doubled the thread count from the base 16 to 32 (4 threads/core). No issues. If I try to go above 32, the polling cycle time drops as expected but half of the devices say they didn’t poll. I don’t see any errors and the UNIX admins don’t see any on their side either.

Happy to share whatever is needed, config is pretty basic. I have added snmp max repeaters to 50. and disabled most polling.

I have been stuck here since last week and can’t find anything online that helps. I know the base (I believe outdated) recommendation is 2 threads/core but my base install was 8/core and it was fine. We have pretty beefy hardware. We checked memcache and its fine, not sure where else the problem may lie.

murrant · 23 May 2019 04:35

Going to high overloads your system.
This is a 5 minute interval

|                                                       |

Lets say this the duration for running your polling with 8 threads

||||||||||||||||||||||||||||||||||||||||||||||             |

Now this is what it takes with 32 threads

|||||||||||                                                |

But the amount of resources consumed to poll your devices is fixed (primarily memory and mysql connections). So instead of spreading the memory usage out over 4 minutes, you have now used all those resources within a 1 minute timespan. This overloads your server and breaks stuff. You want your total polling time to be as close to 5 minutes as possible while still staying under it comfortably.

Did you check logs/librenms.log? likely there are errors about too many mysql connections.

murrant · 23 May 2019 04:39

Also, perhaps this might interest you:

https://docs.librenms.org/Extensions/Dispatcher-Service/

Ben_Gulledge · 23 May 2019 05:34

No errors anywhere. As for overload, that’s not really a concern because they are VM’s on UCS chassis, we can add more resources easily. Right now the resources are not being exhausted which is why I am struggling on fixing the problem. We did tweak MySQL just now, it was getting close to max connections but was not maxed yet. I am doing some tests and will post an update.

Ben_Gulledge · 23 May 2019 05:38

Btw, thank you for the input, we are cutting our teeth on this and I realize maybe pushing the limits on what many people do with it.

I do plan to test the new poller, that is on my list for sure but it’s not a simple move. It’s not clear why we need a Redis DB to coordinate the nodes if we already have memcache doing it. Or is this an older document that doesn’t include the fact that memcache may be in place as per the design?

murrant · 23 May 2019 14:19

I know of installs with 30,000 devices. But yes, it takes a bit of tuning and care.

murrant · 23 May 2019 14:20

Because it doesn’t use memcache anymore, just redis, it is totally different code. Memcache can’t easily handle what we are doing with the dispatcher service and is ill-suited for the task.

Ben_Gulledge · 2 June 2019 19:07

I am considering this resolved. The error I see I believe comes when a resource in the ecosystem (anywhere almost) gets exhausted. We upped our CPU’s and tweaked MySQL and we were able to go from 32 threads to 128. We still see the error while testing scalability but are now running 16 threads per core.