Seem to be running into scaling issues of some sort.
I have set up an environment for a company as follows:
1x Main LNMS server only handling memcached, RRDCached (>1.7) and apache. 16 cores 32GB memory
1x MariaDB server (10.2.2) running ONLY MySQL 16 cores, 16 GB memory
7 x Pollers All running with 16 cores and 16 GB memory. I have set up the polling thread on each to 64 in the crontab.
So my problem is this:
I add devices to poller_group 0 to be polled by Poller 1. This goes well, and I leave the system running for roughly an hour and the average polling time in the webui registers as ~95 seconds for its 210-odd devices.
Now I add devices to poller_group 1 to be polled by Poller 2. This still goes well, again I leave the system running for half an hour to an hour, without any errors reported, for any of the devices. Poller 1’s time goes up a bit, to about ~110 seconds, Poller 2’s time comes in at about 80 seconds, for it’s 190-odd devices.
Then the issues start. When I add the third poller, and add the 200 devices that needs to be polled by it to poller_group 2, all of a sudden devices that are supposed to be polled by Poller 1 (poller_group 0) starts reporting it has not been polled in the last 15 minutes. Which does not make sense, since those two were running fine before poller 3 was added. We have these issues with only ~600 devices added, and in the end we need to monitor 4500.
My question is, does this sound like the database is not keeping up? Or is there something I am missing? I am running a distributed system on my own environment, ~700 devices, 3 pollers, and all is well. Even on the old 5.5 version of MariaDB.
All of the virtual machines in the LNMS environment are running from the same physical server. Disk inside the machine is SSD, so I don’t think latency is an issue in the server environment.
Any ideas will be GREATLY appreciated?