Latest Version in Ubuntu 16.04 server

Alan_D_Wang · 26 October 2017 22:26

I just spun up a new test instance of librenms on an ubuntu 16.04 server and noticed that unlike the version I had running on a VM on my desktop for a couple of weeks, the instance on the server is missing information when it is polling devices in particular, I have noticed that CPU and memory utilization graphs are no longer being polled or shown and the inventory page is entirely blank. Is there something I missed during setup or is this a glitch with the newest version?

Kevin_Krumm · 26 October 2017 22:35

please validate your install.

run ./validate.php

post output.

Alan_D_Wang · 27 October 2017 00:51

====================================

Component	Version
LibreNMS	1.32-94-ge968e37
DB Schema	212
PHP	7.0.22-0ubuntu0.16.04.1
MySQL	10.0.31-MariaDB-0ubuntu0.16.04.2
RRDTool	1.5.5
SNMP	NET-SNMP 5.7.3

====================================

[OK] Database connection successful
[OK] Database schema correct
[FAIL] Discovery has never run.", "Check the cron job

Alan_D_Wang · 27 October 2017 00:59

To clarify a little more, I have a catalyst vss that polled in just fine with full inventory and health sensors (temperature, optical power levels/voltages), but catalyst 2960x stacks and catalyst compact switches do not have their cpu or memory utilization graphs being drawn nor are their serial number(s) and model info being populated into inventory.

laf · 27 October 2017 18:46

You need to check cron as per the output, discovery having never run is an issue.

Alan_D_Wang · 28 October 2017 00:16

I did check cron, and ran discovery shortly after running the validate script. Even after doing so, the validate script still pulls up a [FAIL] for discovery, however it does appear now that polling is happening as it should and all expected health/interface monitors as well as inventory info is successfully being shown.

Alan_D_Wang · 11 January 2018 16:16

The original issue has been worked out, however are their any guidelines for sizing hardware for a VM, I have about 570 devices in one local instance with no distributed pollers and roughly 46000 interfaces (both active and inactive) polled. I am beginning to see a lot of gaps in graphs. Originally when I saw this issue, installing and enabling rrdcached seemed to help a bit. I just installed and enabled memcached and it doesn’t seem to do much. Currently the VM has 16GB ram and 6 vCPU allocated to it. ./validate.php returns ok for everything except for a long list that changes every time I re-run it of devices that have not been polled in 5 minutes and the error “Fail: The poller (libre-nms) has not completed within the last 5 minutes, check the cron job.
” Where should I check the cron job? I checked the documentation and my poller wrapper is currently set at 16

alanbboyd · 11 January 2018 16:54

I have a similar scale to yours with a single poller. Initially I could only scrounge 8 vCPU with 8GB RAM from our server team and I struggled to complete a poll of the entire network within the 5 minute window.

I’ve since managed to upgrade to 32GB / 12 vCPUs and complete the polling run in a shade over 2 minutes; much better. Newer vHost hardware obviously has a positive effect.

If you can’t get more hardware, in addition to what you have already done with rrdcached, look at what is taking time on the poller modules, and disable those which are irrelevant to your setup, either per-device or per-device type (the .yaml files in /opt/librenms/includes/definitions).

HTH

Alan_D_Wang · 12 January 2018 14:18

Thanks for the tip, it looks like that boosting the vCPU and RAM amounts helped out. Out of curiosity, how large is the disk that you have all hf the graph data stored to? I had this machine setup about 3 or 4 months ago with either a 500GB or 600GB disk and I’m about half full already

Kevin_Krumm · 12 January 2018 14:31

Alan,

Are you running clean-up options? https://docs.librenms.org/#Support/Cleanup-options/

alanbboyd · 12 January 2018 15:02

My installation has been running since ~February 2017 but I only have 70GB disk use. Following on from Kevin’s comments the only change I’ve made to the purge options are to extend the authlog purge to 1 year

$config['authlog_purge'] = 366;

I’m not purging RRD files.

Alan_D_Wang · 13 January 2018 03:42

Kevin,
I just enabled all cleanup options in that doc except for RRD files. After the hardware additions, it appeared all was running smooth, though I’m still seeing a couple device graphs that had a gap show up on them this evening. What other logs should I be looking at and/or optimizations should I look at?

Kevin_Krumm · 13 January 2018 14:04

Logs yeah you could check poller log and see what device is taking longest to poll then go from there.

Alan_D_Wang · 14 January 2018 04:41

I have a couple devices that will spike up in polling times to ~100-150 seconds and no higher. Interestingly enough some of the devices I am seeing gaps in the graphs on are polling just fine within 5-15 seconds. Would you say any of the devices that are going above 100 seconds could be problematic? One such device routinely ends up around the 100 second mark, but has not had a single gap in the graph. If it helps at all, our environment has a lot of catalyst 2960X and catalyst 3850 stacks which consists of 3 or more switches in a stack. The 5 devices that probably have the most interfaces on our campus are our core (VSS pair of 6807xl and one standalone unit), an 8540 wireless controller that has ~1600 active APs on it, and our datacenter VPC pair of Nexus 5500’s that have 14 FEXes dual homed to them.

laf · 14 January 2018 20:57

Go through the performance docs. Lots of info in there which is applicable to this scenario.

Alan_D_Wang · 22 January 2018 23:47

laf,

Thanks for the advice. After a large amount of tweaking, here are my findings:

I had the hardware on the system doubled (went from 6 vCPU to 12, and 8Gb ram to 24Gb) – this helped on the onset, but gaps started showing up after about a week
I enabled rrdcached and memcached – this again seemed to alleviate things for a little while and then gaps started appearing again
A few days ago I disabled several devices that had polling times in the +200 second range and also tweaked polling parameters so that not every single service is polled by every devices
This morning I re-read the optimization guide, ran the mysql optimization script and added a couple things to my my.cnf file:
[mysqld]
innodb_flush_log_at_trx_commit = 0
innodb_buffer_pool_size = 12000M
innodb_buffer_pool_instances = 6

For the time being this seems to have eased the issues I have been having with sporadic (sometimes fairly large) gaps in a good majority of my graphs.

Aside from everything above, if I see some gaps appear again, what should I be on the lookout for?

laf · 23 January 2018 08:18

Just the performance doc.