[Resolved] Gaps In Data for VM interfaces - Tried Everything!

Hi LibreNMS users.

My Issue
I have an issue with gaps in my data, for most of my VMs. I have read and researched this thoroughl and Googled, and read through threads here, but with no luck.

I get missing data, and when I use realtime on <15s I see huge spikes, and negative values too.

One thing that is interesting, is that no port speed is detected for these interfaces as they are VMs. ethtool shows no port speed, which is expected, so it may be that RRDTune won’t properly work because no port speed is reported

Example of gaps in data for VMs

There are no gaps whatsoever in a Juniper device, for example. (Can only post 1 image as a new user…)

My Goal
My aim is to have a very, very simple traffic bill which aggregates several VMs eth0 ports.
All I need is to accurately poll eth0 on a set of Linux VMs, running Debian.

LibreNMS Debug

====================================
Component | Version
--------- | -------
LibreNMS  | 1.60-58-g8a2ce01dc
DB Schema | 2020_02_10_223323_create_alert_location_map_table (159)
PHP       | 7.2.19-0ubuntu0.18.04.1
MySQL     | 10.1.40-MariaDB-0ubuntu0.18.04.1
RRDTool   | 1.7.0
SNMP      | NET-SNMP 5.7.3
====================================

[OK]    Composer Version: 1.9.3
[OK]    Dependencies up-to-date.
[OK]    Database connection successful
[OK]    Database schema correct
[WARN]  Your install is over 24 hours out of date, last update: Mon, 24 Feb 2020 08:49:42 +0000
	[FIX]:
	Make sure your daily.sh cron is running and run ./daily.sh by hand to see if there are any errors.
[WARN]  Your local git contains modified files, this could prevent automatic updates.
	[FIX]:
	You can fix this with ./scripts/github-remove
	Modified Files:
	 bootstrap/cache/.gitignore
	 html/js/lang/de.js
	 html/js/lang/en.js
	 html/js/lang/fr.js
	 html/js/lang/ru.js
	 html/js/lang/uk.js
	 html/js/lang/zh-TW.js
	 includes/definitions/linux.yaml
	 logs/.gitignore
	 rrd/.gitignore
	 storage/app/.gitignore
	 storage/app/public/.gitignore
	 storage/debugbar/.gitignore
	 storage/framework/cache/.gitignore
	 storage/framework/cache/data/.gitignore
	  and 4 more...

The error above is due to some changes I made to the linux.yaml definition, to remove the graphs at the top of the device and only show the device_bits graph.

What I’ve Done To Try and Fix The Data Issue

  • Installed rrdcached (it’s working)
  • Changed from 5 to 1 minute polling, used rrdstep
  • Tried using tune_port, enabling RRD Tune Globally and on each port
  • Checked logs and health and available resource to my LibreNMS server

I have also turned off most discovery and polling modules for Linux servers, and even tuned my SNMPd config on the VMs, to only expose specific MIBs, to keep polling time down (~1.5s per host) - my total poller time is around 20 seconds.

From snmpd.conf

view 	libre-mibs 	included	.1.3.6.1.2.1.2
view 	libre-mibs 	included	.1.3.6.1.2.1.1

More Info

I have tried testing polling manually, and this is where it gets very strange, I see results. for
eth0bps or eth0negative on different polling runs, also the values fluctuate massively. The traffic on the VM is steady, as shown below.

I gathered this data using:
while true ; do date ; ./poller.php -h 28 |grep eth0 ; sleep 10 ; done

The results:

Wed 26 Feb 16:55:52 UTC 2020
Port eth0: eth0 (2 / #713) VLAN =  eth0bps(46.23 Mbps/249.49 Mbps)bytes(242.5 MB/1.28 GB)pkts(78.5 kpps/97.45 kpps)
Wed 26 Feb 16:56:03 UTC 2020
Port eth0: eth0 (2 / #713) VLAN =  eth0negative ifOutOctetsbps(85.8 Mbps/0 bps)bytes(122.74 MB/-2.37 MB)pkts(82.6 kpps/102.34 kpps)
Wed 26 Feb 16:56:15 UTC 2020
Port eth0: eth0 (2 / #713) VLAN =  eth0bps(35.01 Mbps/2.87 Gbps)bytes(12.52 MB/1 GB)pkts(77.45 kpps/95.71 kpps)
Wed 26 Feb 16:56:26 UTC 2020
Port eth0: eth0 (2 / #713) VLAN =  eth0negative ifOutOctetsbps(43.73 Mbps/0 bps)bytes(62.55 MB/-904.03 MB)pkts(62.33 kpps/77.81 kpps)
Wed 26 Feb 16:56:38 UTC 2020
Port eth0: eth0 (2 / #713) VLAN =  eth0negative ifOutOctetsbps(41.01 Mbps/0 bps)bytes(53.78 MB/-105.9 MB)pkts(91.58 kpps/114.34 kpps)
Wed 26 Feb 16:56:49 UTC 2020
Port eth0: eth0 (2 / #713) VLAN =  eth0negative ifOutOctetsbps(41.42 Mbps/0 bps)bytes(54.32 MB/-238.28 MB)pkts(86.55 kpps/103.01 kpps)
Wed 26 Feb 16:57:00 UTC 2020
Port eth0: eth0 (2 / #713) VLAN =  eth0negative ifOutOctetsbps(143.01 Mbps/0 bps)bytes(187.53 MB/-564.53 MB)pkts(84.39 kpps/95.57 kpps)
Wed 26 Feb 16:57:11 UTC 2020
Port eth0: eth0 (2 / #713) VLAN =  eth0bps(51.03 Mbps/2.04 Gbps)bytes(66.91 MB/2.61 GB)pkts(57.5 kpps/71.77 kpps)

And corresponding results from the same VM, using ifstat

  Time           eth0
HH:MM:SS   KB/s in  KB/s out
16:56:02  11642.53  346103.5
16:56:12   4582.17  341816.0
16:56:22   6712.92  360460.1
16:56:32   4597.14  339000.6
16:56:42   4516.98  332766.2
16:56:52   4904.77  320262.0
16:57:02  17968.14  304737.7
16:57:12   7309.22  308829.2

Any advice whatsoever would be appreciated.

Thanks

Manually run ./poller.php -d -h host_id -m ports and you will also see the returned value in snmp.

If the returned value in snmp for the interface is wrong, there is nothing we could do

Thanks for the reply. I have attached the grepped output of the poller above. I am really not sure why sometimes I get eth0bps and others eth0negative on occasion.

I have ran the poller every 5 seconds with debug, and grepped the eth0 interface in/out octets and calculated the difference for each 5 second run for a specific VM

		                Count	Difference
ifInOctets.2	=	3451622675	
ifInOctets.2	=	3470338065	18715390
ifInOctets.2	=	3491320446	20982381
ifInOctets.2	=	3510886710	19566264
ifInOctets.2	=	3533200359	22313649
ifInOctets.2	=	3554484595	21284236
ifInOctets.2	=	3605333884	50849289
ifInOctets.2	=	3625814384	20480500
ifInOctets.2	=	3657292211	31477827
ifInOctets.2	=	3679268370	21976159
ifInOctets.2	=	3700787418	21519048
			
			
		                 Count	Difference
ifOutOctets.2	=	1777841236	
ifOutOctets.2	=	3459858673	1682017437
ifOutOctets.2	=	803618875	-2656239798
ifOutOctets.2	=	2382803055	1579184180
ifOutOctets.2	=	4028147832	1645344777
ifOutOctets.2	=	1460210559	-2567937273
ifOutOctets.2	=	3147358843	1687148284
ifOutOctets.2	=	562007710	-2585351133
ifOutOctets.2	=	3343395107	2781387397
ifOutOctets.2	=	934626935	-2408768172
ifOutOctets.2	=	2757740986	1823114051

It seems that the OutOctets count is pretty weird - I get negative values. Has anyone seen this before?

Strangely, I have a PRTG installation which reports the data accurately and smoothly. I am unsure what LibreNMS is doing differently here. Both use eth0 and both use SNMP.

  • Does anyone have any suggestions on how I can investigate why snmp is producing this data?
  • Does the missing port speed, as it’s a VM interface make any difference to RRDTune?

Thanks

Further info:

It seems SNMPd is sending the wrong data

iso.3.6.1.2.1.2.2.1.16.2 = Counter32: 1750347208
iso.3.6.1.2.1.2.2.1.16.2 = Counter32: 213197334
iso.3.6.1.2.1.2.2.1.16.2 = Counter32: 1881851181
iso.3.6.1.2.1.2.2.1.16.2 = Counter32: 172947614
iso.3.6.1.2.1.2.2.1.16.2 = Counter32: 2271055706
iso.3.6.1.2.1.2.2.1.16.2 = Counter32: 3092938172
iso.3.6.1.2.1.2.2.1.16.2 = Counter32: 790367169

I think it may be related to using a 32 bit counter. Is there a way I can force Librenms to use a specific OID, to use a 64 bit counter? e.g
http://oidref.com/1.3.6.1.2.1.31.1.1.1.6

Thank you

See Counter64 For Linux in / out thread for a fix.