Outages page has blank outage end times

blakangel · 6 October 2020 00:36

At a certain point in the past week the outages page got messed up and devices are showing up as down when they come back up, and up when they go down - the Red/Green on the left column is wrong, the IP address link color will indicate “up” if the device is really up.

It started with a certain device that has no “end time” for it’s outage on 9/29/20. More recent outage entries also do not log end times. The outages page is also in UTC instead of my configured timezone that alerts show up in, but I can’t remember if this was always the case.

Another symptom is there is no “uptime” stat in the device page either when they get in this state and I get things like:
Downtime
50 years 291 days 12 minutes 22 seconds

There are a few entries with unix epoch time in the outages as the start time:
1970-01-01 00:00:00
(earlier today the devices with the unix epoch time were different than they are now that I just looked).

I first noticed this on the outages page but it could be a symptom of some other problem.

Anyway thanks for any suggestions with getting this back to normal.

====================================

Component	Version
LibreNMS	1.68-25-gc2db8f3c5
DB Schema	2020_09_22_172321_add_alert_log_index (174)
PHP	7.2.15-1+0~20190209065123.16+stretch~1.gbp3ad8c0
Python	3.5.3
MySQL	10.1.37-MariaDB-0+deb9u1
RRDTool	1.6.0
SNMP	NET-SNMP 5.7.3

====================================

ottorei · 10 October 2020 11:46

I am also seeing this same issue with a distributed poller setup. Some devices have multiple null-values in device_outages table and new outages are added when they are detected. However, old outages never end on these devices. I am running the latest version on all systems and validate.php shows no errors.

blakangel · 10 October 2020 16:09

In my case I can see the point at which the outages information started being recorded wrong. Do you think the solution is manually editing the table to remove all the bad entries and see if it goes back to normal?

SourceDoctor · 11 October 2020 13:36

manual edting the table should work.
I think it’s an issue which comes up with multiple distributed poller, but cannot find a reason for it.

ottorei · 12 October 2020 05:58

In my case, I do have a distributed poller setup but the problematic devices are only polled using one poller and that poller group has only one poller anyway.

ottorei · 12 October 2020 06:02

We were testing this issue last night with SourceDoctor and made an enhancement on two database queries. I could not reproduce this issue anymore with those fixes. However, you should probably remove those strange rows from the database manually or you could try something like this:

Devices that have an ongoing outage but are actually up:

select devices.device_id, devices.hostname, device_outages.going_down, device_outages.up_again
from devices
inner join device_outages on devices.device_id = device_outages.device_id
where devices.status = 1 and device_outages.up_again is null;

Deleting these rows:

delete device_outages
from devices
inner join device_outages on devices.device_id = device_outages.device_id
where devices.status = 1 and device_outages.up_again is null;

I also deleted rows with going_up = 0;

blakangel · 12 October 2020 15:09

Thanks for the queries. I’m planning on installing phpmyadmin or something to do some visual browsing. My SQL skills are very basic.

Are you seeing this same thing:
Here are the oldest rows (by date) of the outages page: https://i.imgur.com/bADuzSE.png
The hosts seem to change every time we have a device go down. One of the devices in that list is new and has never been online.

And here is where the problem started on 9/29/2020:
https://i.imgur.com/CEyStKE.png When that one device never got an end time to the outage recorded. Since then every outage entry has the wrong duration, and almost all of them lack an “End Time”

I’ll let you know what I see when I start querying that table directly. Thanks again!

blakangel · 12 October 2020 20:00

After running the above queries and deleting those rows the outages page looks much better - no rows w/ same device and NULL end times. I didn’t see any devices with going_up = 0

I also manually deleted a few rows from the device_outages table that were duplicate outages for the same device that had different outage start times. (For devices that are currently down).

Well as I was typing this another device went down for a couple minutes and back up and I have a new entry in the outages page with a NULL end time. I do have 1 minute ping polling set up per this page in the docs: https://docs.librenms.org/Extensions/Fast-Ping-Check/

Thanks again.

ottorei · 13 October 2020 12:29

You mean it works now or the null value stays there even when the device is polled again?

Are you on the latest version?

I just tried this again with the following scenario:

The device goes down due to a blocking firewall rule
NMS detects that the device is down and starts an outage. One row is inserted into the database.

MariaDB [librenms]> select * from device_outages where device_id = 2;
±----------±-----------±-----------+
| device_id | going_down | up_again |
±----------±-----------±-----------+
| 2 | 1598598633 | 1598599547 |
| 2 | 1598602274 | 1598602882 |
| 2 | 1601880077 | 1601900140 |
| 2 | 1602139576 | 1602225945 |
| 2 | 1602241278 | 1602241845 |
| 2 | 1602487878 | 1602591046 |
| 2 | 1602591092 | NULL |

I manually insert some bogus data to the outages_table
MariaDB [librenms]> insert into device_outages (device_id,going_down,up_again) values (2,1602591272,NULL);
MariaDB [librenms]> insert into device_outages (device_id,going_down,up_again) values (2,1602591292,NULL);
MariaDB [librenms]> insert into device_outages (device_id,going_down,up_again) values (2,1602591112,NULL);
Lets check the table now:
MariaDB [librenms]> select * from device_outages where device_id = 2;
±----------±-----------±-----------+
| device_id | going_down | up_again |
±----------±-----------±-----------+
| 2 | 1598598633 | 1598599547 |
| 2 | 1598602274 | 1598602882 |
| 2 | 1601880077 | 1601900140 |
| 2 | 1602139576 | 1602225945 |
| 2 | 1602241278 | 1602241845 |
| 2 | 1602487878 | 1602591046 |
| 2 | 1602591092 | NULL |
| 2 | 1602591112 | NULL |
| 2 | 1602591272 | NULL |
| 2 | 1602591292 | NULL |
±----------±-----------±-----------+
I now lift the firewall rule to allow the traffic to the device again.
NMS polls the device and marks it up.
The latest outage ends

MariaDB [librenms]> select * from device_outages where device_id = 2;
±----------±-----------±-----------+
| device_id | going_down | up_again |
±----------±-----------±-----------+
| 2 | 1598598633 | 1598599547 |
| 2 | 1598602274 | 1598602882 |
| 2 | 1601880077 | 1601900140 |
| 2 | 1602139576 | 1602225945 |
| 2 | 1602241278 | 1602241845 |
| 2 | 1602487878 | 1602591046 |
| 2 | 1602591092 | NULL |
| 2 | 1602591112 | NULL |
| 2 | 1602591272 | NULL |
| 2 | 1602591292 | 1602591903 |
±----------±-----------±-----------+

blakangel · 13 October 2020 17:31

Yes I’m still not logging the outage end times after a device get’s polled again. Since the DB cleanup a couple devices have been marked down a couple times and now have duplicate entries with no end time. As I was typing this a device that had a null end time entry from yesterday went down for about 1.5 minutes, alerted, went back up. Nothing was inserted in the device_outages table for that outage.

Yesterday a VPN flapped and 5 devices went down at once for about 1-2 minutes. There is only an outage entry for a single device that has 9 minutes 47 second duration according to outages list, but only 5 minutes between down/up according to device event log. The other 4 devices had < 1 minute downtime according to their event logs and did not generate an alert. They have NULL end time entries in outages. The Start time in Outages is also off by 5 minutes compared to device Event log.

Another thing: Looking at the outages list, I think the minimum outage that logs with an end time is around 10 minutes. If an outage is under ten minutes or so, the end time does not seem to get logged. Almost all of the older (pre-problem) outages have a duration between 9-10 minutes, when the event log shows a 5 minute down/up duration (due to the default 5 minute polling). I monitor a lot of satellite links that have occasional packet loss.

I have a feeling that this problem started when I enabled 1 minute icmp polling on 9/29 (Changing ping_rrd_step to 60 in config.php and adjusting ping.php cronjob).

I did upgrade php from 7.2 to 7.4 yesterday, upgraded debian packages, rebooted, etc.
librenms@ACON-SLC-NMS1:~$ ./validate.php

Component	Version
LibreNMS	1.68-41-ge7f2de79e
DB Schema	2020_09_24_000500_create_cache_locks_table (176)
PHP	7.4.11
Python	3.5.3
MySQL	10.1.45-MariaDB-0+deb9u1
RRDTool	1.6.0
SNMP	NET-SNMP 5.7.3

====================================

[OK] Composer Version: 1.10.15
[OK] Dependencies up-to-date.
[OK] Database connection successful
[OK] Database schema correct

ottorei · 15 October 2020 11:55

I believe it’s not compatible with fast ping yet.

SourceDoctor · 15 October 2020 12:09

with default 5 minutes poller you’ve got no problems?

blakangel · 15 October 2020 20:12

Yes before changing the polling time to 60 seconds I had no issues in the outages page. I didn’t notice the correlation at first but after manually editing the sql database for device_outages I saw that the no outage end times began right after i made the 60 second polling change.

I’ll attempt to reverse the 60 second polling settings and let you know how it goes. I get new rows without end times all the time in the outage page (I poll a lot of lossy satellite modems) so I should be able to see immediately if that fixes the issue.

fuzzbawl · 29 June 2021 16:19

To add a data point here, we are having the same issue and we are also using fast polling of every 60 seconds. I did notice the “start time” appears to be in GMT for the outage, but the Alerts section shows it correctly in localtime.

blakangel · 30 June 2021 18:39

Yeah this ended up making the outages page not useful for us and we ignore it. I do run the below query on occasion to clean out the bad entries in Outages table. I currently have 1975 entries in outages table without end times.

I just changed the polling time to 120 seconds and am running
./scripts/rrdstep.php -h all
to rebuild rrd files with 120 second polling. I’ll see if that makes any differences in the outage page.

fuzzbawl · 21 July 2021 16:35

Just curious what your findings were with that change.

blakangel · 21 July 2021 20:40

No change. I still have a lot of outages with blank end times showing up.

system · 21 July 2023 20:41

This topic was automatically closed 730 days after the last reply. New replies are no longer allowed.