Hello list,
While studying poller-wrapper.py, I came accross the following code snippet:
if str(memc.get(master_tag)) == config['distributed_poller_name']:
print "This system is already joined as the poller master."
sys.exit(2)
if memc_alive():
if memc.get(master_tag) is None:
print "Registered as Master"
memc.set(master_tag, config['distributed_poller_name'], 10)
memc.set(nodes_tag, 0, step)
This made me wonder: isn’t this a race condition? If a group of pollers start simultaniously at */5, there is a chance that multiple pollers reach “memc.get(master_tag) is None”; then use “memc.set” to set their own name? So the 2nd poller would overwrite the 1st master?
So shouldn’t this be something like:
if memc_alive():
if not memc.add(master_tag)
print "Someone else is the master"
if str(memc.get(master_tag)) == config['distributed_poller_name']:
print "Oops, it was me!"
sys.exit(2)
IsNode = True
memc.incr(nodes_tag)
Or am I mistaken?
Answering to myself: yes indeed, it is a race condition.
If the condition triggers, the “faulty” master will end with:
Traceback (most recent call last):
File “/opt/librenms/poller-wrapper.py”, line 383, in
memc.decr(nodes_tag)
File “/usr/lib/python2.7/site-packages/memcache.py”, line 478, in decr
return self._incrdecr(“decr”, key, delta)
File “/usr/lib/python2.7/site-packages/memcache.py”, line 491, in _incrdecr
return int(line)
ValueError: invalid literal for int() with base 10: ‘CLIENT_ERROR cannot increment or decrement non-numeric value’
… and the poller database entry will not update - although the devices will have their entries updated.
I’m in the process of rewriting the poller-wrapper in order to:
- stop being a master/slave construction (because the “master” serves no need anymore, since the name space change for different runs)
- use “step” for max running time - stop running after “step” time
- also, the current sort order is “longest run” first, but IMHO this should be “oldest update first”, so I’m fixing that as well.
I welcome improvements, but don’t go two far, or you’ll re-implement https://github.com/librenms/librenms/blob/master/LibreNMS/service.py
And I think it is ok for the poller to go a little over the poller time. Just so long as they don’t keep stacking up and backlog the server.