Currently we monitor around 1500 active devices with distributed schema, running almost 1 year flawlessly!
here our system diagram:
Storage Server running
rrdcached and memcached
DB Server running
mariadb version 10.1.44 with galera cluster
Load Balancer using
HA Proxy (single write, multi read setup)
our distributed poller run healthy
here one of poller validate:
Problem start occured since 05 February 2020, several port with traffic over 4GB is not graphing. especially Cisco ASR9K Devices.
here the example:
I’ve tried to delete the RRD file and it’s graphing again over 4GB, but the history is lost (of course
Have you looked here
Look at RRD tune and spikes
Thanks for your reply.
RRDTool is enabled globally in our librenms
I’ve tried the CLI also
waiting for 10 minutes and still same
Found some interesting case,
I changed the device folder and RRD files to 777 and graph start to graphing.
Yes I know its dangerous. maybe any bug in librenms?
I suspect its to do with counter32 overflowing.
See my fix here:
I have managed to fix this using an SNMPd config on my Linux VMs.
I basically exclude the counter32 values for the interface MIBs.
Example from snmpd.conf
## LibreNMS MIBs (restricted for faster polling)
# Inc. System Info (name/loc) OIDs
view libre-mibs included .220.127.116.11.2.1.1
# Include Interface MIBs
view libre-mibs included .18.104.22.168.2.1.2
# Include Interface MIBs w/ 64 bit counters for high traffic…
And for more debug info / how I found this issue
Make sure Librenms is set to poll your device with SNMP v2c (or v3) - not SNMPv1. SNMP v1 is limited to 32bit counters only and will rollover at 4Gb.
Any issue with permissions on the RRDs according to validate.php ? Any change on the shared volume of the RRDs between the pollers and the WebServer ? Looks really like an issue on the storage part, we haven’t had any report of RRD issues recently.
Hi all, sorry for very late reply.
It solved after chmod 777 all RRD and run tune_port for all ports. after all change back all RRD permission.