Poller-wrapper.py race condition?


#1

Hello list,

While studying poller-wrapper.py, I came accross the following code snippet:

        if str(memc.get(master_tag)) == config['distributed_poller_name']:
            print "This system is already joined as the poller master."
            sys.exit(2)
        if memc_alive():
            if memc.get(master_tag) is None:
                print "Registered as Master"
                memc.set(master_tag, config['distributed_poller_name'], 10)
                memc.set(nodes_tag, 0, step)

This made me wonder: isn’t this a race condition? If a group of pollers start simultaniously at */5, there is a chance that multiple pollers reach “memc.get(master_tag) is None”; then use “memc.set” to set their own name? So the 2nd poller would overwrite the 1st master?
So shouldn’t this be something like:

if memc_alive():
   if not memc.add(master_tag)
       print "Someone else is the master"
       if str(memc.get(master_tag)) == config['distributed_poller_name']:
            print "Oops, it was me!"
            sys.exit(2)
       IsNode = True
       memc.incr(nodes_tag)

Or am I mistaken?


#2

Answering to myself: yes indeed, it is a race condition.

If the condition triggers, the “faulty” master will end with:

Traceback (most recent call last):
File “/opt/librenms/poller-wrapper.py”, line 383, in
memc.decr(nodes_tag)
File “/usr/lib/python2.7/site-packages/memcache.py”, line 478, in decr
return self._incrdecr(“decr”, key, delta)
File “/usr/lib/python2.7/site-packages/memcache.py”, line 491, in _incrdecr
return int(line)
ValueError: invalid literal for int() with base 10: ‘CLIENT_ERROR cannot increment or decrement non-numeric value’

… and the poller database entry will not update - although the devices will have their entries updated.

I’m in the process of rewriting the poller-wrapper in order to:

  • stop being a master/slave construction (because the “master” serves no need anymore, since the name space change for different runs)
  • use “step” for max running time - stop running after “step” time
  • also, the current sort order is “longest run” first, but IMHO this should be “oldest update first”, so I’m fixing that as well.

#3

I welcome improvements, but don’t go two far, or you’ll re-implement https://github.com/librenms/librenms/blob/master/LibreNMS/service.py

And I think it is ok for the poller to go a little over the poller time. Just so long as they don’t keep stacking up and backlog the server.