Smokeping failures on reload

I have smokeping integrated with LibreNMS running in the packer VM and there is a cron task to regenerate smokeping configs and reload the systemd service.

root@librenms:~# cat /etc/cron.hourly/librenms-smokeping
#! /usr/bin/env bash

sudo -u librenms /opt/librenms/lnms smokeping:generate --targets > /etc/smokeping/config.d/librenms-targets.conf
sudo -u librenms /opt/librenms/lnms smokeping:generate --probes  > /etc/smokeping/config.d/librenms-probes.conf

systemctl reload smokeping > /dev/null 2<&1

I found that occasionally when the cron task fires, it causes the parent process (864410 below) to exit uncleanly, causing the systemd unit to go into a failed state which prevents subsequent tasks from completing.

Mar 03 22:17:01 librenms systemd[1]: Reloading Latency Logging and Graphing System.
Mar 03 22:17:01 librenms systemd[1]: Reloaded Latency Logging and Graphing System.
Mar 03 22:17:01 librenms smokeping[864410]: Reloading configuration.
Mar 03 22:17:02 librenms smokeping[1609673]: Got HUP signal, exiting gracefully.
Mar 03 22:17:02 librenms smokeping[1609673]: Exiting due to HUP signal.
Mar 03 22:17:02 librenms smokeping[1609674]: Got HUP signal, exiting gracefully.
Mar 03 22:17:02 librenms smokeping[1609674]: Exiting due to HUP signal.
Mar 03 22:17:02 librenms smokeping[864410]: Waiting for child processes to terminate.
Mar 03 22:17:02 librenms smokeping[864410]: Child processes terminated, restarting with new configuration.
Mar 03 22:17:02 librenms smokeping[864410]: Entering multiprocess mode.
Mar 03 22:17:02 librenms smokeping[864410]: No targets defined for probe FPing6, skipping.
Mar 03 22:17:02 librenms smokeping[864410]: No targets defined for probe lnmsFPing6-0, skipping.
Mar 03 22:17:02 librenms smokeping[864410]: No targets defined for probe FPing, skipping.
Mar 03 22:17:02 librenms smokeping[864410]: Child process 1615858 started for probe lnmsFPing-0.
Mar 03 22:17:02 librenms smokeping[864410]: Child process 1615859 started for probe lnmsFPing-1.
Mar 03 22:17:02 librenms smokeping[864410]: No targets defined for probe lnmsFPing6-1, skipping.
Mar 03 22:17:02 librenms smokeping[864410]: All probe processes started successfully.
Mar 03 22:17:02 librenms smokeping[1615859]: lnmsFPing-1: probing 4 targets with step 300 s and offset 191 s.
Mar 03 22:17:02 librenms smokeping[1615858]: lnmsFPing-0: probing 5 targets with step 300 s and offset 169 s.
Mar 03 23:17:02 librenms systemd[1]: Reloading Latency Logging and Graphing System.
Mar 03 23:17:02 librenms smokeping[864410]: Reloading configuration.
Mar 03 23:17:02 librenms systemd[1]: Reloaded Latency Logging and Graphing System.
Mar 03 23:17:03 librenms smokeping[1615858]: Got HUP signal, exiting gracefully.
Mar 03 23:17:03 librenms smokeping[1615858]: Exiting due to HUP signal.
Mar 03 23:17:03 librenms smokeping[1615859]: Got HUP signal, exiting gracefully.
Mar 03 23:17:03 librenms smokeping[1615859]: Exiting due to HUP signal.
Mar 03 23:17:03 librenms smokeping[864410]: Waiting for child processes to terminate.
Mar 03 23:17:03 librenms smokeping[864410]: Can't call method "step" on an undefined value at /usr/share/perl5/Smokeping.pm line 4406.
Mar 03 23:17:03 librenms systemd[1]: smokeping.service: Main process exited, code=exited, status=1/FAILURE
Mar 03 23:17:03 librenms systemd[1]: smokeping.service: Failed with result 'exit-code'.
Mar 04 00:17:02 librenms systemd[1]: smokeping.service: Unit cannot be reloaded because it is inactive.
Mar 04 01:17:01 librenms systemd[1]: smokeping.service: Unit cannot be reloaded because it is inactive.
Mar 04 02:17:01 librenms systemd[1]: smokeping.service: Unit cannot be reloaded because it is inactive.
Mar 04 03:17:01 librenms systemd[1]: smokeping.service: Unit cannot be reloaded because it is inactive.

To remedy this, I modified the script to run systemctl restart if the systemd reload fails

(systemctl reload smokeping || systemctl restart smokeping) > /dev/null 2<&1

I wasn’t sure if this should be raised as an issue but hopefully that will help someone in future.

1 Like

Thanks for posting this- I was experiencing the same thing. Seems like it will run fine for a random amount of time and then just sort of stop and not be heard from until there is a manual interaction. This certainly helped me!

I had exact the same error, this is a noob question but where exactly I put line to restart smokeping automatically

Thanks in advance

Change the last line of /etc/cron.hourly/librenms-smokeping

So, this is a bug in the perl module, but I’m not exactly sure why (as I only know enough perl to be mildly dangerous).
What’s happening is that the smokeping daemon is crashing while trying to kill off the child probe process(es). Smokeping is fully dead at this point, and then an hour later cron runs the script again, and subsequent attempts to systemctl reload won’t work on something that isn’t in a “running” state anymore.

Could probably change
(systemctl reload smokeping || systemctl restart smokeping)
into
systemctl reload-or-restart smokeping
which should effectively do the same thing. You’re still going to have an hour of missing ping info though, since the crash doesn’t get “resolved” until the next time cron fires the script (at which point it will try to reload, see the systemd unit is “inactive” and then restart instead).

[edit] this would probably work slightly better:

systemctl reload smokeping > /dev/null 2<&1
(systemctl is-active --quiet smokeping || systemctl start smokeping) > /dev/null 2<&1

First line still tries to reload as normal.
Second line checks if it’s active (which returns zero); if it isn’t active, it returns non-zero and then the || kicks in and fires off an attempt to start it again (which should handle the case that it crashes during reload)

Again, this is all semi-bad if you manually offlined the daemon to do some sort of work; the hourly script will keep restarting it. You’d need to temporarily mask it so it couldn’t be restarted by cron.

Would be nice if someone with more perl experience could figure out why the crash is actually happening and the root cause could be fixed…

This topic was automatically closed 186 days after the last reply. New replies are no longer allowed.