During one of the recent updates, it seems my distributed pollers started polling the same device multiple times. Said differently, we monitor about 1,300 devices, but each poller is trying to hit each of the 1,300.
Back a few months ago and the system seemed to automatically divy up the workload. As we added devices, we deployed a new pollers with a completely standard config, all devices in distributed_poller_group zero, and it worked flawless. Trying to balance our load manually is going to be a headache and a half.
I believe that I have checked all the obvious things (rrdcached and memcached are accessible from each of the pollers), current/matching codebase on all the pollers and main server and a matching app_key on all the pollers and the main server. The app_key was new to me and I had hoped that was the fix, but… alas it was not.
Any thoughts on why the behavior changed and how to revert back to the old behavior?
Thank you for any guidance!
All systems have the same validate.php output:
|DB Schema||2020_04_13_150500_add_last_error_fields_to_bgp_peers (164)|
[OK] Composer Version: 1.10.6
[OK] Dependencies up-to-date.
[OK] Database connection successful
[OK] Database schema correct
[WARN] Your install is over 24 hours out of date, last update: Wed, 20 May 2020 13:25:27 +0000
Screenshot of all the pollers trying to poll each device: