Need some design help

jhartlov · 3 June 2020 23:17

Need a littte bit of design help.

Currently, I have LibreNMS installed along with MariaDB Server, Nginx and Oxidized all on one 16 core 32GB of ram VM. When we first designed it, if was supposed to be for our Cisco network devices only. We later added our firewalls, and recently added all of our servers, vmware hosts and the like. The systems were not part of the original design as we were planning on a managed service migration that never happened.

Long story short is we originally were at about 325 network devices. That has grown to over 500 in about a years time. Once we added all of our servers and vmware hosts we jumped quick to about 750 host entries and about 15000 interfaces. We have a couple of projects that I can see where we would add another 50 or 100 by the end of 2020.

Our network is spread out across mainly the east coast. We have a large presence in Baltimore, DC, several in Northern Virginia, New York City, Orlando. We have smaller presences in Atlanta and Las Vegas with two more POPs planned and a number of customer sites where we monitor routers, and/or switches for circuit handoffs. 90% of our network is connected via private leased circuits while one full project (about 10%) is connected via flexvpn.

As of right now, the current box works fairly reasonably. The main issue that we have are constant erroneous node down alerts both from ICMP and SNMP. I have also noticed some interesting irregularities with pseudowire and bgp reports that I am able to work around. After talking to a number of people in the discord we started to plan out replacement system that could scale a little better. We took the advice of the “scaling” document on the LibreNMS website and planned out the following.

Standalone SQL Server. 8-16 cores. 32-64gb of RAM.
Standalone Web Server. 8 cores. 16-32gb of RAM
Standalone API only Webserver. 4 cores and 8-16gb of RAM (this would also be used for Oxidized)

Then, one standalone poller in each datacenter / specific large scale project (maybe at a limit of 100-200 hosts per poller)

We had originally discussed a standalone RRDCache server as the scaling document had suggested but ultimately decided to punt that idea in favor of putting it on the web server. We are still open to having this as a standalone server as the document lays out, but understand that it could clearly be overkill.

When we built our original box we kept thinking it would never get “huge” and a year later we were all proven wrong. I feel like now that we have determined this is going to be a huge part of our business, we want to build something this time around that will scale well and not lead us to performance issues a year down the road. I feel like this also would be a good chance to fix any mistakes we made during the original install.

Please bare with me for any stupid questions I may have along the way. I am a network engineer by trade and this project fell to me. It it isn’t Cisco, sometimes I get myself spun up. Any bit of $0.02 on how this could be best laid out would be appreciated. Many thanks in advance to anyone willing to help. Cheers!
-john

Hans_Erasmus · 4 June 2020 08:33

Hi John

First off always remember the following design principle when referring to 'it will never grow that big:
"Take whatever you plan on having in an environment, times that by 2, and then add a zero. Only then will it be ‘enough’ ".

So a couple of ideas from my side.
I have implemented LNMS to monitor a 1800 device, ~45000 ports network, and I used the following.

Hardware
Fujitsu Primergy RX2530 M5
2 x Intel XEON 4210 (Total of 20 Cores and 40 Logical cores)
256 GB memory
SSD storage in RAID10

I know immediately people will say “this is not redundant, you only have one piece of hardware.” I know, unfortunately we only have space for one server in the hosted location, so it will have to do.
For this deployment it was initially the idea that we will divide the device into device groups based on pre-defined regions set up by the customer, and that each poller will only poll their own region.(More on that later). So my setup looks like this:

1 x LNMS Core server with 8 cores and 32GB of RAM running:
LNMS
Apache
Memcached Server
RRDcached Server

1 x SQL server with 4 cores and 24GB of RAM
MySQL
Redis Server (For dispatcher service)

7 x Poller servers with 4 Cores and 16GB RAM each.
I don’t have a NFS server running.

I ran into some scaling issues in the beginning setting it up the standard way. So I decided to use the Dispatcher service. I know this is still a RC version, but that was my choice and I may have just been lucky, but it is working well for me so far.

From what you have said I think having a poller in each DC makes sense (but if you can, maybe have 2, for redundancy). I had a scenario where 2 of my pollers got an issue by locking up and not polling, but because of my setup, nothing was missed.

Hope this 2c gives you some ideas?

jhartlov · 4 June 2020 16:50

Very thankful for your detailed reply and approach. I can see from this that I may not really need a separate API server, and that Oxidized can run effectively on the web server. It hadn’t thought about the dispatcher service but especially with what I was reading this morning it may be a good direction to run in. The only thing that I may do is increase the amount of RAM in SQL server just because I have it to spare.

Hans_Erasmus · 5 June 2020 10:17

Yeah the API stuff for me is running on the apache server itself, and I basically added ALL my devices via API, and have possibly made about 6000-10000 API calls already without it giving me a second of trouble. The Dispatcher Service is a point of ongoing discussion, so I just again want to stress that it is working for me, but using it was completely my choice.
I too have a lot of RAM to spare, but I am not seeing issues so far.
One more thing I did do however, was run the mysql tuner script the docs refer to. That also helped quite a bit.

A last point from me, more sort of a general point I want to make. There is not a “one size fits all recipe”. And there are all kinds of tweaks one can do to enhance the performance of your system. But all the little tweaks will amount to a large gain in the end. Luckily the software is so well written that you can get gains from various areas of tweaking, you just have to find the ones that work for you.

jhartlov · 7 July 2020 22:16

In the middle of my new install. Wish me luck. I’m gonna need it!