Please help with for fast-ping checks problems

DR3EVR8u8c · 21 September 2018 06:49

Hello,
I was so excited to find the extension of Fast-ping checks. However, after I have followed the instruction, the fast-ping doesn’t work as I expected.

Set ping_rrd_step
$config[‘ping_rrd_step’] = 60;

Update the rrd files
./scripts/rrdstep.php -h all

Update cron (removing any other ping.php or alert.php entries)
* * * * * librenms /opt/librenms/ping.php >> /dev/null 2>&1

also the config.php:

$config['fping'] = "/usr/sbin/fping";
$config['fping_options']['timeout'] = 500;
$config['fping_options']['count']   = 10;
$config['fping_options']['interval'] = 500;
$config['fping_options']['retries'] = 2;

I also setup the alert delay to 3m and I am expecting I will receive alert for 3m after the host is shut down.

but no, the device down event is only detected during normal polling.
When I ran the debug mode of ping.php, below is the result:

-bash-4.2$ ./ping.php -d -v
SQL[select `devices`.`device_id`, `hostname`, `status`, `status_reason`, `last_ping`, `last_ping_timetaken`, `max_depth` from `devices` left join `devices_attribs` on `devices`.`device_id` = `devices_attribs`.`device_id` and `devices_attribs`.`attrib_type` = ? where `disabled` = ? and (`devices_attribs`.`attrib_value` is null or `devices_attribs`.`attrib_value` != ?) order by `max_depth` asc ["override_icmp_disable",0,"true"] 0.68ms]

Tier 0 (3): 10.202.70.148, cloned-librenms01, 10.202.70.134
'fping' '-f' '-' '-e' '-t' '500' '-r' '2'
cloned-librenms01 is alive (0.13 ms)
Attempting to record data for cloned-librenms01... Deferred
10.202.70.134 is alive (1.10 ms)
Attempting to record data for 10.202.70.134... Deferred
10.202.70.148 is unreachable
Attempting to record data for 10.202.70.148... Deferred
Leftover devices, this shouldn't happen: cloned-librenms01, 10.202.70.134, 10.202.70.148
Devices left in tier:
Pinged 3 devices in 2.41s

EDIT 1:
the Deferred problem is resolved. I discovered some outdated dependency of PHP. once updated, the ping.php can update successfully.
However, the alert still doesn’t work correctly. Please see the Librenms log for your information:

EDIT 2:
I have a quick look on alerts.php and found that it is calculated the duration of the incident based on column time_logged in alert_log table. the ping.php won’t update the table, that is why the alert won’t be fired based on the real event time. Could you please update the ping.php and let it update the time_logged in alert_log table? should I raise as a bug?

EDIT 3:
the previous assumption may be not correct, I don’t know much about PHP coding. but the ping.php really doesn’t make sense to me. similar to the above scenario, which doesn’t trigger the alarm counting, the ping.php doesn’t stop the alert when the device come beck up in time. please see below picture as example:

Strangely, I was monitoring the database and running the rule query manually on database since its status changed to down util the status changed back to up. the query returns correct result as it should be:

But, why the alert is still triggered? this bug cause the fast-ping check is totally unusable.

Please help me in the troubleshooting or at least explain where could be wrong.
Thank you very much.

Daniel_Schmidt · 21 September 2018 15:54

Admittedly, I don’t have any experience with this. However, I am curious: Your cron job runs every minute, your rrd step is 60, but your ping interval is 500. Are you sure that 500 is what you want?

DR3EVR8u8c · 21 September 2018 19:49

I followed the instruction from librenms docs and that recommends the settings. besides, the ping.php is running every minute by cron.

Daniel_Schmidt · 24 September 2018 17:36

But, the interval is 500, so it would not run more than that, no?

DR3EVR8u8c · 25 September 2018 06:11

According to fping man, fping interval is msec. therefore, interval = 500 means 500 msecs.
and in config[‘ping_rrd_step’] = 60, which is 60 secs.

besides, now I discovered that the ping.php can update the devices.status and devices.status_reason in database, which also confirms that the settings are configured correctly.

however, for unknown reason, alert.php did not read the status correctly, or, it use cache instead? not sure. hopefully, someone could help me to solve out.

murrant · 26 September 2018 13:57

Could it be that they run at the same time and don’t know that the other has updated the status?

All the code for ping.php is in app\Jobs\PingCheck.php btw.

DR3EVR8u8c · 27 September 2018 11:29

Hi @murrant,
the ping.php updated the devices.status 2 mins before the alert triggered. and that is what freaking me out, the alerts were still generated even though the devices.status was up!? I assume that the alerts.php query database live to get the latest information of the device, Do you know if it is right?

there are memcache, distributed pollers running on our librenms system, would that confuse the alerts.php. which script I should check for this?

Thanks,
Roger

murrant · 27 September 2018 13:07

poller.php and ping.php run alert rules and update alerts db. alerts.php checks db and issues alerts as needed.

DR3EVR8u8c · 3 October 2018 01:11

Hi @murrant,
Please forgive me if I’m wrong. I found that in the alerts.inc.php the RunAlerts function, which is called by alerts.php, is calculating delay time based on $alert[‘time_logged’]), which is loaded from alert_log.time_logged column. But I could not find where the table is updated from poller.php and ping.php. Could you please point me the right direction?

// This is the new way
if (!empty($rextra['delay']) && (time() - strtotime($alert['time_logged']) + $config['alert']['tolerance_window']) < $rextra['delay']) {
   continue;

according to what I observed from librenms db, the ping.php seems not updating the time_logged and that cause the problem we are facing.
Thanks for your help.
Roger

murrant · 3 October 2018 15:05

time_logged is set to the current timestamp when the alert_log entry is created. It is never updated.

DR3EVR8u8c · 4 October 2018 02:10

Hi @murrant,
I have found the problem. the PingCheck.php saves the device status after it run rules. Therefore, the rules cannot be detected correctly as the devices table hasn’t been updated. While Poller.php update the devices status in db before it run the rules detection.

I have fixed the problem in a nasty way by simply adding devices->save() before the RunRules function. Please see below code for your reference:

if ($device->isDirty('status')) {
                // if changed, update reason
                $device->status_reason = $device->status ? '' : 'icmp';
                $type = $device->status ? 'up' : 'down';
                log_event('Device status changed to ' . ucfirst($type) . " from icmp check.", $device->toArray(), $type);
                $device->save();
                echo "Device $device->hostname changed status to $type, running alerts\n";
                RunRules($device->device_id);
            }

so far, it is doing what we want. but you may want to review and code it better.
Thanks very much for your help.
Cheers,
Roger

murrant · 8 October 2018 18:23

That fix looks good to me. Nice work. Can you create a pull request on github to get it merged upstream?