Hi everyone
I’m looking at a project to monitor 5000 or more devices. I plan to use multiple pollers, a separate DB server, Redis, RRDcached etc. I’m not clear on two things.
Firstly, I have no idea how the hardware will perform with the workload so it’s difficult to define specs and numbers of pollers, though adding pollers over time should be no issue. Having searched, the closest I can find to recommendations are these:
https://docs.librenms.org/Support/Example-Hardware-Setup/#requirements-for-distributed-polling
The Reddit comment written by @laf suggests the web UI, RRDcached and DB servers should be bare metal with 32 cores, 32 GB RAM. Is there any chance VMs would cut it for the web UI/RRDcached and DB servers with this number of devices? Since more pollers can always be added I guess it doesn’t matter if they’re VMs.
The other consideration is HA. Polling isn’t a problem but there are presumably 3 other roles to consider:
Web UI/API
RRDcached
DB
MariaDB/MySQL has its own master-master replication so we just need two or more DB hosts.
This says the web UI and RRDcached can only run on one host. Is it possible to have two web UIs running independently of each other without causing issues? Presumably you’d have to turn off a load of crons on one of them and enable them manually if the other host failed.
For RRDcached, @murrant suggests it’s possible to proxy to multiple RRDcached hosts using Nginx.
Alternatively I was thinking of an active/passive pair proxied by Ngnix with the RRDs sitting on some shared storage over NFS which is replicated to a backup host (e.g. using ZFS or similar). If the active RRDcached host fails, I fire up the passive one manually and everything carries on. If the RRD storage fails, we restore the latest snapshot to something capable and carry on, or use some kind of HA clustered storage.
Thoughts on specs and HA?