Need a littte bit of design help.
Currently, I have LibreNMS installed along with MariaDB Server, Nginx and Oxidized all on one 16 core 32GB of ram VM. When we first designed it, if was supposed to be for our Cisco network devices only. We later added our firewalls, and recently added all of our servers, vmware hosts and the like. The systems were not part of the original design as we were planning on a managed service migration that never happened.
Long story short is we originally were at about 325 network devices. That has grown to over 500 in about a years time. Once we added all of our servers and vmware hosts we jumped quick to about 750 host entries and about 15000 interfaces. We have a couple of projects that I can see where we would add another 50 or 100 by the end of 2020.
Our network is spread out across mainly the east coast. We have a large presence in Baltimore, DC, several in Northern Virginia, New York City, Orlando. We have smaller presences in Atlanta and Las Vegas with two more POPs planned and a number of customer sites where we monitor routers, and/or switches for circuit handoffs. 90% of our network is connected via private leased circuits while one full project (about 10%) is connected via flexvpn.
As of right now, the current box works fairly reasonably. The main issue that we have are constant erroneous node down alerts both from ICMP and SNMP. I have also noticed some interesting irregularities with pseudowire and bgp reports that I am able to work around. After talking to a number of people in the discord we started to plan out replacement system that could scale a little better. We took the advice of the āscalingā document on the LibreNMS website and planned out the following.
Standalone SQL Server. 8-16 cores. 32-64gb of RAM.
Standalone Web Server. 8 cores. 16-32gb of RAM
Standalone API only Webserver. 4 cores and 8-16gb of RAM (this would also be used for Oxidized)
Then, one standalone poller in each datacenter / specific large scale project (maybe at a limit of 100-200 hosts per poller)
We had originally discussed a standalone RRDCache server as the scaling document had suggested but ultimately decided to punt that idea in favor of putting it on the web server. We are still open to having this as a standalone server as the document lays out, but understand that it could clearly be overkill.
When we built our original box we kept thinking it would never get āhugeā and a year later we were all proven wrong. I feel like now that we have determined this is going to be a huge part of our business, we want to build something this time around that will scale well and not lead us to performance issues a year down the road. I feel like this also would be a good chance to fix any mistakes we made during the original install.
Please bare with me for any stupid questions I may have along the way. I am a network engineer by trade and this project fell to me. It it isnāt Cisco, sometimes I get myself spun up. Any bit of $0.02 on how this could be best laid out would be appreciated. Many thanks in advance to anyone willing to help. Cheers!
-john