Cron job resource starvation (multiple poller-wrappers)

vsessink · 5 February 2019 13:34

Hello list,

The distributed poller configuration for cron.d/librenms states that a “poller-wrapper.py” should run like:

*/5  *    * * *   librenms    /opt/librenms/cronic /opt/librenms/poller-wrapper.py 16

(… with “16” a configurable number for the number of threads). However, there seems to be no means of ensuring cron doesn’t start a new job if an older poller-wrapper has not finished. This results in resource starvation in our poller group: if a poller-wrapper process does not finish in time, a second poller-wrapper is started; this leads to an even slower machine, so the second is even more unlikely to finish in time; etcetera.

Is there a way to prevent this from happening?

TheGreatDoc · 6 February 2019 13:41

Well, if your pollers cant get all the data in less than 5 minutes you should go to a higher poller interval.

Check https://docs.librenms.org/Support/1-Minute-Polling/ but changing values to match 10 minutes (example) and then modify cron to be */10 instead /5

vsessink · 6 February 2019 13:58

Hi TheGreatDoc,

Thank you for your reply. Generally, my pollers are able to fetch everything in time. However, sometimes things go wrong. Let’s say a network link becomes very slow. This has a sort of cascading poller implosion as a result: the poller run takes too long - second poller run starting while the first still running, maybe even a third poller run starting - after a while: machine resource starvation; then other poller machines become busier (more data to poll) - etcetera.

So a situation where a new run would never start when another was still busy, would be preferrable in this situation. Do I understand correctly, that there is no such option in the poller-wrapper and/or the poller?

murrant · 6 February 2019 16:48

Why does overlapping cause an issue? How long are the old ones running? You are probably trying to fix this issue the wrong way. You need to figure out why the poller is taking so long. Even a “slow network” should not extend the time that much.

Because the way RRD works, not starting a poll on time would cause gaps in the graphs and other bad data. It is not feasible to “wait”.

vsessink · 6 February 2019 17:34

I don’t have enough pollers and the pollers aren’t optimized, that doesn’t help much. I’m working on that (it’s a bureaucracy, could take a while

Still, I notice that the pollers will eat themselves if enough “other” pollers have died.

This is a geographically distributed site that can have links go to backup status. Then even a single snmpwalk may suddenly take 15 minutes or more.

Anyway: I’m at it, I’m doing what I can to get more poller servers; I have tuned the snmp settings; I’m tuning the number of poller threads but while in the process of getting things right the rigth way, it would help to fix the resource starvation. Anyway, thanks for your help.

murrant · 6 February 2019 17:38

How many devices do you have? (I have 400 and one poller is not working hard).

You could reduce the snmp command timeout so it stops if the data transfer take too long.

Again, what resource is being starved? Processes in io_wait don’t take much resources.

vsessink · 6 February 2019 17:59

About 2000 devices; the pollers are relatively low-end (virtual) machines. The network is new to us (inherited maintenance mid january), so I’m still finding things out. We have seen memory and CPU load go to the max; also, in a few cases, rrd on the central server was unreachable which filled the drive space reeeealy quickly. But at the moment, most of our pollers are unreachable because they have multiple python2 poller_wrapper.py processes that, in turn, fire of several poller.php processes - that run the snmp fetching jobs.

As said: I’m in the process of getting things right. Setting the SNMP max_repeaters option helped a lot for single poller processes taking too long over slow networks, for example.

What hardware is your poller on?

murrant · 6 February 2019 19:26

Xeon E5-2620, 8GB RAM and SSD storage