Librenms redundancy question

sneak2k2 · 27 August 2019 22:37

Team,

I have successfully deployed Librenms to monitor some of our more critical locations for my company. The team really likes what this tool provides and has asked me to take it past a Proof of concept stage. With that in mind I am curious on what would be the best way to set this up.

Currently I have one server, for around 120 network devices, and am eating up approximately 33Gig of space on my drive. I currently have 16Gig memory and 4 virtual CPU’s ( intel xeon Gold 2148 clocked at 2.4Ghz) .
With this load, and polling intervals set to 5 minutes, running htop, i see the CPU go nuts, between 90-100 % consistent. I am ignoring polling/alerting on interfaces that I don’t care about i.e ( end user switchports). I am also concerned about redundancy, if this server goes down, we lose visiblity to everything.

I was thinking of standing up another instance on a VM in the another data center. I could split up the polling of 120 devices between the two data centers. We want to add some customer edge routers around 80 or so, so it would essentially be 200 devices, split up between the two instances.

What i’m not sure of, is the following:

Considering the # of devices we are adding, is two instances enough to support the # of devices we have? Even if I split 100 in each, i’m almost at the same processing as I am now with 118 devices, not sure it buys me much from a performance perspective. Also - I can see quite possibly us adding another 2 more instances in a more secure part of the network , and those instances would report back to the server I have today. I’m thinking around 300-400 routers/switches/load balancers would be the highest this could go within the next couple years.

With that being said, if that one server, goes kaboom, we lose everything. My thought is to have two full blown instances setup for our (head end webUI, memcache/DB, alerting), distribute the polling among the two, and receive polling data from the other 2 servers in our DMZ. I imagine, I would have to turn off Email alerting on the 2nd head end instance, as to not annoy the hell out of everyone with double alerts, and manually enable it on the secondary if the primary fails. Does this sound like a good approach? Or would you make any recommendations. I I don’t want to re-invent the wheel if this has been done by folks already :slight_smile

garysteers · 29 August 2019 22:31

Hi @sneak2k2,

Assuming you have read the following:
https://docs.librenms.org/Extensions/Distributed-Poller/

This talks about running the distributed pollers etc.

Alerting wise you could as you say disable it on the backup, or even wrap it in an alive check on the primary host (a quick php script running to read something from the database and a remote check using curl could help here).

That way if the primary server stops working then you will automatically get an alert.

Also with the remote pollers remember to set the group ID on that (and on ping checks), and you can always add more and add the -h odd and -h even flags to share the load.

Also bear in mind access from the UI servers to the end devices.

sneak2k2 · 30 August 2019 22:10

Thx!. Yes I did read the doc. I figured there should be a way to script the alerting as you mention, I just have not explored it yet. We are going to discuss as a team soon how many servers we need. I was just hoping to have come across someone who has done this and can provide any lessons learned or gotchas :). I’m leaning to having 3 more servers. One as a backup head end, and 2 more distributed pollers for different portions of the network.

TheGreatDoc · 1 September 2019 08:55

Hi @sneak2k2

You are talking about 2 things (if I understand correctly).

Performance issues:
A single poller, like the one you said, is very capable of doing 200 devices.

If your CPU is 90%-100% all the time, take a look to: https://docs.librenms.org/Support/Performance/

Specially to the rrdcached part.

Redundancy/HA

You can run an almost full HA env. Almost full coz sadly, rrdtool doesnt support HA, but you can reach a very good one working with distributed poling, external redundant storage, etc.

Hope that helps!