I have noticed today some devices which have had most of their RRD traffic graphs wiped clean due the port IDs re-enumerating to new IDs. It suspiciously correlates to the recovery of a long site outage. It’s isolated to some devices at one site - rest of the sites/devices etc. not affected.
It’s caused me significant problems with the loss of historical data, but priority is the fresh data for now - so not restoring from backups - may attempt a merge one day. It’s the first time I’ve seen it happen, so want to see if I can find any root cause to address and help others avoid the pain.
Has anyone else experienced this or have clues to what conditions may cause this?
The devices have been added for several months with no changes, the system has been through various upgrades with no issues, and has been on 21.8.0 since release with no other issues.
The outage is shown here on a device at the same site which kept all its graphs - recovery around 1:35AM
The start of the poller logs happens shortly after the device coming back online, as you’d expect:
A device which was unaffected at the same site shows nothing different in its logs at the start:
Across the site, about 1/3 of the devices has this issue - here’s a sample of unaffected and problematic ones (two on left, and top right).
librenms.log for two identical devices, one affected and one not around the times:
/opt/librenms/discovery.php 59 2021-09-02 00:33:05 - 0 devices discovered in 4.033 secs /opt/librenms/poller.php 59 2021-09-02 00:35:06 - 1 devices polled in 4.129 secs ... /opt/librenms/poller.php 59 2021-09-02 01:30:06 - 1 devices polled in 4.107 secs /opt/librenms/poller.php 59 2021-09-02 01:36:47 - 1 devices polled in 105.1 secs /opt/librenms/poller.php 59 2021-09-02 01:41:44 - 1 devices polled in 101.5 secs
/opt/librenms/discovery.php 58 2021-09-02 00:33:14 - 0 devices discovered in 4.028 secs /opt/librenms/poller.php 58 2021-09-02 00:35:06 - 1 devices polled in 4.137 secs ... /opt/librenms/poller.php 58 2021-09-02 01:30:06 - 1 devices polled in 4.113 secs /opt/librenms/poller.php 58 2021-09-02 01:36:34 - 1 devices polled in 92.17 secs /opt/librenms/poller.php 58 2021-09-02 01:41:34 - 1 devices polled in 92.37 secs
==================================== Component | Version --------- | ------- LibreNMS | 21.8.0 DB Schema | 2021_08_04_102914_add_syslog_indexes (213) PHP | 7.3.30-1+ubuntu18.04.1+deb.sury.org+1 Python | 3.6.9 MySQL | 10.5.12-MariaDB-1:10.5.12+maria~bionic RRDTool | 1.7.0 SNMP | NET-SNMP 5.7.3 ==================================== [OK] Composer Version: 2.1.6 [OK] Dependencies up-to-date. [OK] Database connection successful [OK] Database schema correct
Anyone seen this before, or know what may trigger this port ID change - particularly when there are outages involved - are they related at all?