Scaling to 5000 devices

Hi everyone

I’m looking at a project to monitor 5000 or more devices. I plan to use multiple pollers, a separate DB server, Redis, RRDcached etc. I’m not clear on two things.

Firstly, I have no idea how the hardware will perform with the workload so it’s difficult to define specs and numbers of pollers, though adding pollers over time should be no issue. Having searched, the closest I can find to recommendations are these:

https://www.reddit.com/r/LibreNMS/comments/5nz09f/librenms_specspoller_performance_for_larger/dcjnab5/

https://community.librenms.org/t/can-i-use-librenms-for-more-than-20-000-equipments-how-instances-work/18166

https://docs.librenms.org/Support/Example-Hardware-Setup/#requirements-for-distributed-polling

The Reddit comment written by @laf suggests the web UI, RRDcached and DB servers should be bare metal with 32 cores, 32 GB RAM. Is there any chance VMs would cut it for the web UI/RRDcached and DB servers with this number of devices? Since more pollers can always be added I guess it doesn’t matter if they’re VMs.

The other consideration is HA. Polling isn’t a problem but there are presumably 3 other roles to consider:

Web UI/API
RRDcached
DB

MariaDB/MySQL has its own master-master replication so we just need two or more DB hosts.

This says the web UI and RRDcached can only run on one host. Is it possible to have two web UIs running independently of each other without causing issues? Presumably you’d have to turn off a load of crons on one of them and enable them manually if the other host failed.

For RRDcached, @murrant suggests it’s possible to proxy to multiple RRDcached hosts using Nginx.

Alternatively I was thinking of an active/passive pair proxied by Ngnix with the RRDs sitting on some shared storage over NFS which is replicated to a backup host (e.g. using ZFS or similar). If the active RRDcached host fails, I fire up the passive one manually and everything carries on. If the RRD storage fails, we restore the latest snapshot to something capable and carry on, or use some kind of HA clustered storage.

Thoughts on specs and HA?

You might need to be a bit more specific about what your goals are with the HA component.

We run two seperate instances in two different locations and then use OpsGenie’s alert de-duplication to handle the duplicated alarms.

This means we can run updates/maintenance on an instance and the other instance is still operational.

We use k8s for all the pollers and UI nodes and then run bare metal for the SQL/rrdcached. The bare metal box has some crazy NVMe stripped array and memory cpu etc

The question of ‘how many pollers = how many devices’ is very difficult to calculate.

For us due to geographic dispersion some of our poller runtimes are fairly long for a single device as there is latency on every request/response etc, and some of our devices are very large devices (virtual chassis etc lots of ports/sub-interfaces), or large qtys of bgp peers etc.

Some of our devices are pretty simple but we have thousands of them.

We basically keep an eye on the amount of used poller seconds vs available poller seconds, and number of devices ‘unpolled’ for a poll run - and when this gets close to ‘full’ we add another pod into the k8s.

Each poller thread consumes one persistent connection to the redis server - so sizing redis connections and keeping an eye on them is a good idea. Similar but less with mysql.

We run about 96 workers per poller (would not suggest this as a start point) and we have max redis about 12,0000 but using ~7-8k and mysql max connections 4k using about 2k

I could be wrong but I think there was more than a gig of persistent traffic to/from the db/redis/mysql host so a good network might be on the list of things to have.

Probably what I will end up doing is splitting it all up again so we have a couple of instances per continent (for the ‘HA’) and so we don’t have things like pollers pushing from one side of the world to the other and reducing blast radius on failure / link issues

We run a sidecar instance of telegraf on all the poller pods buffering back into a central influxdb host with a 7 day retention so ops teams can build pretty dashboards in grafana specific to their needs

In terms of pollers one of our instances has 15 in one zone, 22 in another, 12 in another and 5 in another - so 54 pollers … pods have 2g request, 3gb limit, 6 cpu allocation … telegraf 256mb limit, 64mb request buffer size 5000 etc…

1 Like

Just double checked – network traffic for us is about 150-200Mbps avg out on a bonded set of interfaces peaking 480Mbps at some stage … We run MariaDB, rrdCached, redis and influxdb on this host fwiw but typically only used by LibreNMS (a little bit of gNMI traffic)

1 Like

Thank you for taking the time to write this fantastically detailed reply and my apologies for taking a while to come back to you.

Regarding your UI nodes, do you have more than one per instance? I was trying to figure out how to separate up the UI from being the LibreNMS master node and RRD storage. I guess in your case, you have your RRDs on your bare metal, presumably shared over NFS or something and all your nodes write to its rrdcached. I guess I could put more than one UI node in a dedicated poller group which has no devices to poll.

All access to RRDdata is via rrdcached (from both pollers and ui nodes) so you don’t need NFS export or anything like that.

We run two nodes in a Deployment on kubernetes for the UI handling. They basically run ngnix and php-fpm and no polling processes so they don’t try to run any poller/discovery tasks etc.

The poller nodes just run librenms-service.py. The master node election stuff only includes nodes running librenms-service.py.

Thanks for taking the time to reply and again, my apologies for taking a while to come back.