Rrdtool race condition

dennypage · 14 May 2020 00:08

Hello, I’m trying to find out if a bug reported a while back (six months or so) has been fixed or not. The bug pertains to failing graph images, particularly when using rrdcached. I took a quick look at the code and didn’t find an obvious fix, but it was a convoluted race condition in the core so it may have been fixed and I just don’t see it. I’ve included the original analysis below in case it triggers anyone’s memory.

Thanks,
Denny

Originally from https://github.com/librenms/docker/issues/51#issuecomment-564363192:

Okay, I think I’ve tracked it down. It appears to be a generic race condition in LibreNMS.

In the html / php code for graphs [includes/html/graphs/graph.inc.php:128 (current git pull)], we call rrdtool_graph() to create a graph file in /tmp. Immediately thereafter, we call is_file() to confirm that the file we requested exists. This call is occasionally failing because the file does not exist (yet).

Why does this happen?

In function rrdtool_graph() [includes/rrdtool.inc.php:100], we call rrdtool_build_command() to build the rrd command, and then invoke $rrd_sync_process->sendCommand() to execute the command and return its output. We expect that when sendCommand() returns, the rrd command has completed, the requested graph file has been written, and that we have the full output of the rrd command.

sendCommand() [LibreNMS/Proc.php:117] uses getOutput() [LibreNMS/Proc.php:141] to read the output from the rrd graph command. getOutput() uses stream_select() to determine when bytes are available to read, followed immediately by stream_get_contents() to read the bytes. The problem is that stream_select() will return as soon as any bytes are available to be read, and when stream_get_contents() is called, it returns only the currently available bytes. Unfortunately, in the error case some initial bytes are available for read, but the graph command had not actually completed yet. We read the initially available bytes and then return. This occurs before rrdtool has actually created the graph file, which results with the failure of is_file().

Hitting this condition on a local install, with local rrd files has to be virtually impossible in normal operation. However, when talking to the rrdcached daemon, rrdtool sends a flush command prior to retrieving the data for the graph. This increases the possibility that there will be a partial read ever so slightly. Even so, hitting the condition with a local unix domain socket would be extremely rare. However, when rrdcached is being accessed via tcp, either in a different container or on a different host, the likelihood grows significantly.

In my particular case, a 15ms sleep prior to the is_file() call prevents the issue from being seen. This obviously isn’t the correct solution. The correct solution is to ensure that rrdtool_graph() doesn’t return until the rrdtool command has actually finished operation, but I would leave this to the LibreNMS team to determine how to best implement.

dennypage · 14 May 2020 17:22

I think I see why nobody from the LibreNMS team has picked this up. I went and looked at the old Observium code that LibreNMS was forked from. The bug predates the fork. Wow.

murrant · 9 April 2021 00:35

You probably know more about this than anyone else.

Is there a way to avoid writing the graph to a file? Seems like a dumb step.

murrant · 9 April 2021 04:02

Well, you made me ask an interesting question: https://github.com/librenms/librenms/pull/12735

dennypage · 9 April 2021 06:54

Hey @murrant, It’s been a while since I did the analysis… I appreciate you looking into it and I will take a look at it again when I can, but it will take a bit of time to get back to the level of understanding I had before.