Poller process hangs

Hello!

Starting from previous week (around Tuesday evening and Wednesday night UTC) poller service has started to behave in a strange way - it does not close the connection to the MySQL database upon finishing the polling job and even if MySQL connection session is dropped due to timeout, poller service process is still not closed/finished. Here is what we had found so far:

  1. Polling process finishes successfully. This can be seen in debug output, where polling stats are listed in the end.
  2. Polling process remains with one open connection to the database with “Sleep” command, for example:
+---------+--------+-----------------------------------------+----------+---------+------+-------+------+
| ID      | USER   | HOST                                    | DB       | COMMAND | TIME | STATE | INFO |
+---------+--------+-----------------------------------------+----------+---------+------+-------+------+
|   18901 | poller | <poller hostname>:46948                 | librenms | Sleep   | 3541 |       | NULL |
+---------+--------+-----------------------------------------+----------+---------+------+-------+------+
  1. As long as this session remains in “Sleep”, the “devices.last_polled” field is not updated - therefore, a device remains marked as unpolled.
  2. Extremely confusing is the fact that this behavior is suspect to completely random devices on different pollers. Among affected devices there are Ubiquiti Unifi AC-LR, Linux OS hosts, HP 2530-48G switches. And again, such devices are polled on different pollers, so there is no real correlation. Only certain hosts are affected - meaning, this error affects only certain ~20 hosts and not the others.
  3. We did not mention any resource consumption anomalies on both LibreNMS central node and poller nodes - CPU, RAM, Inodes are not exceeding the respective thresholds of 70%, 80% and 50%.
  4. Once the the MySQL session is dropped as per wait_timeout setting, the polling process is closed in majority of cases and “devices.last_polled” field is updated. The process can remain in S state (interruptible sleep) forever until it is killed. For example:
librenms  3202  0.0  0.0   4336   760 ?        S    Sep09   0:00 /bin/sh -c /usr/bin/env php /opt/librenms/poller.php -h 1246 >> /dev/null 2>&1
librenms  3203  0.0  0.4 294092 35088 ?        S    Sep09   0:00 php /opt/librenms/poller.php -h 1246
librenms  4600  0.0  0.0   4336   760 ?        S    Sep09   0:00 /bin/sh -c /usr/bin/env php /opt/librenms/poller.php -h 1246 >> /dev/null 2>&1
librenms  4601  0.0  0.4 294092 35164 ?        S    Sep09   0:00 php /opt/librenms/poller.php -h 1246
librenms  7917  0.0  0.0   4336   760 ?        S    08:05   0:00 /bin/sh -c /usr/bin/env php /opt/librenms/poller.php -h 1246 >> /dev/null 2>&1
librenms  7918  0.1  0.4 294092 35132 ?        S    08:05   0:00 php /opt/librenms/poller.php -h 1246
librenms 11867  0.0  0.0   4336   804 ?        S    01:24   0:00 /bin/sh -c /usr/bin/env php /opt/librenms/poller.php -h 1246 >> /dev/null 2>&1
librenms 11868  0.0  0.4 294092 35008 ?        S    01:24   0:00 php /opt/librenms/poller.php -h 1246
librenms 14928  0.0  0.0   4336   756 ?        S    Sep09   0:00 /bin/sh -c /usr/bin/env php /opt/librenms/poller.php -h 1246 >> /dev/null 2>&1
librenms 14930  0.0  0.4 294092 35016 ?        S    Sep09   0:00 php /opt/librenms/poller.php -h 1246
librenms 18001  0.0  0.0   4336   720 ?        S    06:24   0:00 /bin/sh -c /usr/bin/env php /opt/librenms/poller.php -h 1246 >> /dev/null 2>&1
librenms 18002  0.0  0.4 294092 35212 ?        S    06:24   0:00 php /opt/librenms/poller.php -h 1246
librenms 20896  0.0  0.0   4336   804 ?        S    03:04   0:00 /bin/sh -c /usr/bin/env php /opt/librenms/poller.php -h 1246 >> /dev/null 2>&1
librenms 20897  0.0  0.4 294092 35052 ?        S    03:04   0:00 php /opt/librenms/poller.php -h 1246
librenms 23061  0.0  0.0   4336   708 ?        S    04:44   0:00 /bin/sh -c /usr/bin/env php /opt/librenms/poller.php -h 1246 >> /dev/null 2>&1
librenms 23062  0.0  0.4 294092 35472 ?        S    04:44   0:00 php /opt/librenms/poller.php -h 1246
librenms 26494  0.0  0.0   4336   700 ?        S    Sep09   0:00 /bin/sh -c /usr/bin/env php /opt/librenms/poller.php -h 1246 >> /dev/null 2>&1
librenms 26495  0.0  0.4 294092 35024 ?        S    Sep09   0:00 php /opt/librenms/poller.php -h 1246
librenms 31979  0.0  0.0   4336   796 ?        S    Sep09   0:00 /bin/sh -c /usr/bin/env php /opt/librenms/poller.php -h 1246 >> /dev/null 2>&1
librenms 31980  0.0  0.4 294092 35008 ?        S    Sep09   0:00 php /opt/librenms/poller.php -h 1246
  1. There are two type of errors in MySQL log, they are popping out quite frequently:
2018-09-10T08:21:52.784036Z 1310707 [Note] Aborted connection 1310707 to db: 'librenms' user: 'poller' host: '<poller IP address>' (Got timeout reading communication packets)
2018-09-10T08:22:20.699118Z 0 [Note] InnoDB: page_cleaner: 1000ms intended loop took 13673ms. The settings might not be optimal. (flushed=292 and evicted=0, during the time.)
  1. We could not find anything suspicious entries in LibreNMS logs on both LibreNMS central node and poller nodes.

Here is another inspection of poller via lsof:
Before MySQL connection is killed:

php     3203 librenms    3u  unix 0xffff8800d6837840      0t0 800698324 socket
php     3203 librenms    4u  IPv4          800698332      0t0       TCP 192.168.11.9:60295->nms-central:mysql (ESTABLISHED)
php     3203 librenms    5u  IPv4          800696730      0t0       TCP 192.168.11.9:60296->nms-central:mysql (ESTABLISHED)
php     3203 librenms    6u  IPv4          800696731      0t0       TCP 192.168.11.9:60297->nms-central:mysql (ESTABLISHED)
php     3203 librenms    7u  IPv4          800698376      0t0       TCP 192.168.11.9:60298->nms-central:mysql (ESTABLISHED)
php     3203 librenms    8u  IPv4          800696732      0t0       TCP 192.168.11.9:60299->nms-central:mysql (ESTABLISHED)
php     3203 librenms    9u  sock                0,7      0t0 800698377 can't identify protocol
php     3203 librenms   10u  IPv4          800696733      0t0       TCP 192.168.11.9:60301->nms-central:mysql (CLOSE_WAIT)
php     3203 librenms   11u  sock                0,7      0t0 800698378 can't identify protocol
php     3203 librenms   12u  IPv4          800698379      0t0       TCP 192.168.11.9:60303->nms-central:mysql (ESTABLISHED)
php     3203 librenms   13u  IPv4          800698380      0t0       TCP 192.168.11.9:60304->nms-central:mysql (ESTABLISHED)
php     3203 librenms   14u  IPv4          800698381      0t0       TCP 192.168.11.9:60305->nms-central:mysql (ESTABLISHED)
php     3203 librenms   15u  sock                0,7      0t0 800698382 can't identify protocol
php     3203 librenms   16u  sock                0,7      0t0 800698383 can't identify protocol
php     3203 librenms   17u  sock                0,7      0t0 800696734 can't identify protocol
php     3203 librenms   18u  sock                0,7      0t0 800698384 can't identify protocol
php     3203 librenms   19u  IPv4          800696735      0t0       TCP 192.168.11.9:60310->nms-central:mysql (ESTABLISHED)
php     3203 librenms   20u  IPv4          800698385      0t0       TCP 192.168.11.9:60311->nms-central:mysql (ESTABLISHED)
php     3203 librenms   21r   CHR                1,9      0t0        14 /dev/urandom
php     3203 librenms   22u  sock                0,7      0t0 801505213 can't identify protocol

After MySQL connection is killed:

php     3203 librenms    4u  IPv4          800698332      0t0       TCP 192.168.11.9:60295->nms-central:mysql (ESTABLISHED)
php     3203 librenms    5u  IPv4          800696730      0t0       TCP 192.168.11.9:60296->nms-central:mysql (ESTABLISHED)
php     3203 librenms    6u  IPv4          800696731      0t0       TCP 192.168.11.9:60297->nms-central:mysql (ESTABLISHED)
php     3203 librenms    7u  IPv4          800698376      0t0       TCP 192.168.11.9:60298->nms-central:mysql (ESTABLISHED)
php     3203 librenms    8u  IPv4          800696732      0t0       TCP 192.168.11.9:60299->nms-central:mysql (ESTABLISHED)
php     3203 librenms    9u  sock                0,7      0t0 800698377 can't identify protocol
php     3203 librenms   10u  IPv4          800696733      0t0       TCP 192.168.11.9:60301->nms-central:mysql (CLOSE_WAIT)
php     3203 librenms   11u  sock                0,7      0t0 800698378 can't identify protocol
php     3203 librenms   12u  IPv4          800698379      0t0       TCP 192.168.11.9:60303->nms-central:mysql (ESTABLISHED)
php     3203 librenms   13u  IPv4          800698380      0t0       TCP 192.168.11.9:60304->nms-central:mysql (ESTABLISHED)
php     3203 librenms   14u  IPv4          800698381      0t0       TCP 192.168.11.9:60305->nms-central:mysql (ESTABLISHED)
php     3203 librenms   15u  sock                0,7      0t0 800698382 can't identify protocol
php     3203 librenms   16u  sock                0,7      0t0 800698383 can't identify protocol
php     3203 librenms   17u  sock                0,7      0t0 800696734 can't identify protocol
php     3203 librenms   18u  sock                0,7      0t0 800698384 can't identify protocol
php     3203 librenms   19u  IPv4          800696735      0t0       TCP 192.168.11.9:60310->nms-central:mysql (ESTABLISHED)
php     3203 librenms   20u  IPv4          800698385      0t0       TCP 192.168.11.9:60311->nms-central:mysql (CLOSE_WAIT)
php     3203 librenms   21r   CHR                1,9      0t0        14 /dev/urandom
php     3203 librenms   22u  sock                0,7      0t0 801505213 can't identify protocol

Poller debug output: https://gist.github.com/angryp/e71cbe5531389ee3e757785258e888c2
Poller debug strace output: https://gist.github.com/angryp/7d69e3108c61289613447e477fe77189 (SIGINT at the end comes from me interrupting the process)
./validate output on LibreNMS central node (poller nodes run the same version of code):

====================================
Component | Version
--------- | -------
LibreNMS  | 1.43-38-g9f0afa2
DB Schema | 265
PHP       | 5.6.36-0+deb8u1
MySQL     | 5.7.20-log
RRDTool   | 1.6.0
SNMP      | NET-SNMP 5.7.2.1
====================================

[OK]    Composer Version: 1.7.2
[OK]    Dependencies up-to-date.
[OK]    Database connection successful
[FAIL]  Database: extra table (plugin_config_rules)
[FAIL]  We have detected that your database schema may be wrong, please report the following to us on IRC or the community site (https://t.libren.ms/5gscd):
        [FIX] Run the following SQL statements to fix.
        SQL Statements:
         DROP TABLE `plugin_config_rules`;
[WARN]  Some devices have not been polled in the last 5 minutes. You may have performance issues.
        [FIX] Check your poll log and see: http://docs.librenms.org/Support/Performance/
        Devices:
         <list of devices>
          and 94 more...
[FAIL]  Some devices have not completed their polling run in 5 minutes, this will create gaps in data.
        [FIX] Check your poll log and see: http://docs.librenms.org/Support/Performance/
        Devices:
         <list of devices>
          and 387 more...
[WARN]  Your local git contains modified files, this could prevent automatic updates.
        [FIX] You can fix this with ./scripts/github-remove
        Modified Files:
         rrd/.gitignore

We are still using poller-service.py for distributed polling and currently run updated librenms-service.py in a test environment. Does the behavior we are experiencing is the one described in documentation as “old poller service has a serious defect under certain versions of mysql/mariadb, and may be inadvertently DoS’ing your devices”? Or is this something else? Maybe someone has experienced this previously and found a solution for this?
We did try to restart poller-service.py, rrdcached, mysqld, nginx, php-fpm and did try to tune MySQL configuration as per mysql_tuner.pl advice with no success.
We will obviously switch over to updated librenms-service.py (and quite possibly switching is going to be forced to carry out ASAP now) and let you know if this solves the issue.

This has been resolved by setting -B -R options in rrdcached settings.

Missed that “Permit recursive subdirectory creation in the base directory” flag. Thanks angryp.