Hello!
Starting from previous week (around Tuesday evening and Wednesday night UTC) poller service has started to behave in a strange way - it does not close the connection to the MySQL database upon finishing the polling job and even if MySQL connection session is dropped due to timeout, poller service process is still not closed/finished. Here is what we had found so far:
- Polling process finishes successfully. This can be seen in debug output, where polling stats are listed in the end.
- Polling process remains with one open connection to the database with “Sleep” command, for example:
+---------+--------+-----------------------------------------+----------+---------+------+-------+------+
| ID | USER | HOST | DB | COMMAND | TIME | STATE | INFO |
+---------+--------+-----------------------------------------+----------+---------+------+-------+------+
| 18901 | poller | <poller hostname>:46948 | librenms | Sleep | 3541 | | NULL |
+---------+--------+-----------------------------------------+----------+---------+------+-------+------+
- As long as this session remains in “Sleep”, the “devices.last_polled” field is not updated - therefore, a device remains marked as unpolled.
- Extremely confusing is the fact that this behavior is suspect to completely random devices on different pollers. Among affected devices there are Ubiquiti Unifi AC-LR, Linux OS hosts, HP 2530-48G switches. And again, such devices are polled on different pollers, so there is no real correlation. Only certain hosts are affected - meaning, this error affects only certain ~20 hosts and not the others.
- We did not mention any resource consumption anomalies on both LibreNMS central node and poller nodes - CPU, RAM, Inodes are not exceeding the respective thresholds of 70%, 80% and 50%.
- Once the the MySQL session is dropped as per wait_timeout setting, the polling process is closed in majority of cases and “devices.last_polled” field is updated. The process can remain in S state (interruptible sleep) forever until it is killed. For example:
librenms 3202 0.0 0.0 4336 760 ? S Sep09 0:00 /bin/sh -c /usr/bin/env php /opt/librenms/poller.php -h 1246 >> /dev/null 2>&1
librenms 3203 0.0 0.4 294092 35088 ? S Sep09 0:00 php /opt/librenms/poller.php -h 1246
librenms 4600 0.0 0.0 4336 760 ? S Sep09 0:00 /bin/sh -c /usr/bin/env php /opt/librenms/poller.php -h 1246 >> /dev/null 2>&1
librenms 4601 0.0 0.4 294092 35164 ? S Sep09 0:00 php /opt/librenms/poller.php -h 1246
librenms 7917 0.0 0.0 4336 760 ? S 08:05 0:00 /bin/sh -c /usr/bin/env php /opt/librenms/poller.php -h 1246 >> /dev/null 2>&1
librenms 7918 0.1 0.4 294092 35132 ? S 08:05 0:00 php /opt/librenms/poller.php -h 1246
librenms 11867 0.0 0.0 4336 804 ? S 01:24 0:00 /bin/sh -c /usr/bin/env php /opt/librenms/poller.php -h 1246 >> /dev/null 2>&1
librenms 11868 0.0 0.4 294092 35008 ? S 01:24 0:00 php /opt/librenms/poller.php -h 1246
librenms 14928 0.0 0.0 4336 756 ? S Sep09 0:00 /bin/sh -c /usr/bin/env php /opt/librenms/poller.php -h 1246 >> /dev/null 2>&1
librenms 14930 0.0 0.4 294092 35016 ? S Sep09 0:00 php /opt/librenms/poller.php -h 1246
librenms 18001 0.0 0.0 4336 720 ? S 06:24 0:00 /bin/sh -c /usr/bin/env php /opt/librenms/poller.php -h 1246 >> /dev/null 2>&1
librenms 18002 0.0 0.4 294092 35212 ? S 06:24 0:00 php /opt/librenms/poller.php -h 1246
librenms 20896 0.0 0.0 4336 804 ? S 03:04 0:00 /bin/sh -c /usr/bin/env php /opt/librenms/poller.php -h 1246 >> /dev/null 2>&1
librenms 20897 0.0 0.4 294092 35052 ? S 03:04 0:00 php /opt/librenms/poller.php -h 1246
librenms 23061 0.0 0.0 4336 708 ? S 04:44 0:00 /bin/sh -c /usr/bin/env php /opt/librenms/poller.php -h 1246 >> /dev/null 2>&1
librenms 23062 0.0 0.4 294092 35472 ? S 04:44 0:00 php /opt/librenms/poller.php -h 1246
librenms 26494 0.0 0.0 4336 700 ? S Sep09 0:00 /bin/sh -c /usr/bin/env php /opt/librenms/poller.php -h 1246 >> /dev/null 2>&1
librenms 26495 0.0 0.4 294092 35024 ? S Sep09 0:00 php /opt/librenms/poller.php -h 1246
librenms 31979 0.0 0.0 4336 796 ? S Sep09 0:00 /bin/sh -c /usr/bin/env php /opt/librenms/poller.php -h 1246 >> /dev/null 2>&1
librenms 31980 0.0 0.4 294092 35008 ? S Sep09 0:00 php /opt/librenms/poller.php -h 1246
- There are two type of errors in MySQL log, they are popping out quite frequently:
2018-09-10T08:21:52.784036Z 1310707 [Note] Aborted connection 1310707 to db: 'librenms' user: 'poller' host: '<poller IP address>' (Got timeout reading communication packets)
2018-09-10T08:22:20.699118Z 0 [Note] InnoDB: page_cleaner: 1000ms intended loop took 13673ms. The settings might not be optimal. (flushed=292 and evicted=0, during the time.)
- We could not find anything suspicious entries in LibreNMS logs on both LibreNMS central node and poller nodes.
Here is another inspection of poller via lsof:
Before MySQL connection is killed:
php 3203 librenms 3u unix 0xffff8800d6837840 0t0 800698324 socket
php 3203 librenms 4u IPv4 800698332 0t0 TCP 192.168.11.9:60295->nms-central:mysql (ESTABLISHED)
php 3203 librenms 5u IPv4 800696730 0t0 TCP 192.168.11.9:60296->nms-central:mysql (ESTABLISHED)
php 3203 librenms 6u IPv4 800696731 0t0 TCP 192.168.11.9:60297->nms-central:mysql (ESTABLISHED)
php 3203 librenms 7u IPv4 800698376 0t0 TCP 192.168.11.9:60298->nms-central:mysql (ESTABLISHED)
php 3203 librenms 8u IPv4 800696732 0t0 TCP 192.168.11.9:60299->nms-central:mysql (ESTABLISHED)
php 3203 librenms 9u sock 0,7 0t0 800698377 can't identify protocol
php 3203 librenms 10u IPv4 800696733 0t0 TCP 192.168.11.9:60301->nms-central:mysql (CLOSE_WAIT)
php 3203 librenms 11u sock 0,7 0t0 800698378 can't identify protocol
php 3203 librenms 12u IPv4 800698379 0t0 TCP 192.168.11.9:60303->nms-central:mysql (ESTABLISHED)
php 3203 librenms 13u IPv4 800698380 0t0 TCP 192.168.11.9:60304->nms-central:mysql (ESTABLISHED)
php 3203 librenms 14u IPv4 800698381 0t0 TCP 192.168.11.9:60305->nms-central:mysql (ESTABLISHED)
php 3203 librenms 15u sock 0,7 0t0 800698382 can't identify protocol
php 3203 librenms 16u sock 0,7 0t0 800698383 can't identify protocol
php 3203 librenms 17u sock 0,7 0t0 800696734 can't identify protocol
php 3203 librenms 18u sock 0,7 0t0 800698384 can't identify protocol
php 3203 librenms 19u IPv4 800696735 0t0 TCP 192.168.11.9:60310->nms-central:mysql (ESTABLISHED)
php 3203 librenms 20u IPv4 800698385 0t0 TCP 192.168.11.9:60311->nms-central:mysql (ESTABLISHED)
php 3203 librenms 21r CHR 1,9 0t0 14 /dev/urandom
php 3203 librenms 22u sock 0,7 0t0 801505213 can't identify protocol
After MySQL connection is killed:
php 3203 librenms 4u IPv4 800698332 0t0 TCP 192.168.11.9:60295->nms-central:mysql (ESTABLISHED)
php 3203 librenms 5u IPv4 800696730 0t0 TCP 192.168.11.9:60296->nms-central:mysql (ESTABLISHED)
php 3203 librenms 6u IPv4 800696731 0t0 TCP 192.168.11.9:60297->nms-central:mysql (ESTABLISHED)
php 3203 librenms 7u IPv4 800698376 0t0 TCP 192.168.11.9:60298->nms-central:mysql (ESTABLISHED)
php 3203 librenms 8u IPv4 800696732 0t0 TCP 192.168.11.9:60299->nms-central:mysql (ESTABLISHED)
php 3203 librenms 9u sock 0,7 0t0 800698377 can't identify protocol
php 3203 librenms 10u IPv4 800696733 0t0 TCP 192.168.11.9:60301->nms-central:mysql (CLOSE_WAIT)
php 3203 librenms 11u sock 0,7 0t0 800698378 can't identify protocol
php 3203 librenms 12u IPv4 800698379 0t0 TCP 192.168.11.9:60303->nms-central:mysql (ESTABLISHED)
php 3203 librenms 13u IPv4 800698380 0t0 TCP 192.168.11.9:60304->nms-central:mysql (ESTABLISHED)
php 3203 librenms 14u IPv4 800698381 0t0 TCP 192.168.11.9:60305->nms-central:mysql (ESTABLISHED)
php 3203 librenms 15u sock 0,7 0t0 800698382 can't identify protocol
php 3203 librenms 16u sock 0,7 0t0 800698383 can't identify protocol
php 3203 librenms 17u sock 0,7 0t0 800696734 can't identify protocol
php 3203 librenms 18u sock 0,7 0t0 800698384 can't identify protocol
php 3203 librenms 19u IPv4 800696735 0t0 TCP 192.168.11.9:60310->nms-central:mysql (ESTABLISHED)
php 3203 librenms 20u IPv4 800698385 0t0 TCP 192.168.11.9:60311->nms-central:mysql (CLOSE_WAIT)
php 3203 librenms 21r CHR 1,9 0t0 14 /dev/urandom
php 3203 librenms 22u sock 0,7 0t0 801505213 can't identify protocol
Poller debug output: https://gist.github.com/angryp/e71cbe5531389ee3e757785258e888c2
Poller debug strace output: https://gist.github.com/angryp/7d69e3108c61289613447e477fe77189 (SIGINT at the end comes from me interrupting the process)
./validate output on LibreNMS central node (poller nodes run the same version of code):
====================================
Component | Version
--------- | -------
LibreNMS | 1.43-38-g9f0afa2
DB Schema | 265
PHP | 5.6.36-0+deb8u1
MySQL | 5.7.20-log
RRDTool | 1.6.0
SNMP | NET-SNMP 5.7.2.1
====================================
[OK] Composer Version: 1.7.2
[OK] Dependencies up-to-date.
[OK] Database connection successful
[FAIL] Database: extra table (plugin_config_rules)
[FAIL] We have detected that your database schema may be wrong, please report the following to us on IRC or the community site (https://t.libren.ms/5gscd):
[FIX] Run the following SQL statements to fix.
SQL Statements:
DROP TABLE `plugin_config_rules`;
[WARN] Some devices have not been polled in the last 5 minutes. You may have performance issues.
[FIX] Check your poll log and see: http://docs.librenms.org/Support/Performance/
Devices:
<list of devices>
and 94 more...
[FAIL] Some devices have not completed their polling run in 5 minutes, this will create gaps in data.
[FIX] Check your poll log and see: http://docs.librenms.org/Support/Performance/
Devices:
<list of devices>
and 387 more...
[WARN] Your local git contains modified files, this could prevent automatic updates.
[FIX] You can fix this with ./scripts/github-remove
Modified Files:
rrd/.gitignore
We are still using poller-service.py
for distributed polling and currently run updated librenms-service.py
in a test environment. Does the behavior we are experiencing is the one described in documentation as “old poller service has a serious defect under certain versions of mysql/mariadb, and may be inadvertently DoS’ing your devices”? Or is this something else? Maybe someone has experienced this previously and found a solution for this?
We did try to restart poller-service.py
, rrdcached, mysqld, nginx, php-fpm and did try to tune MySQL configuration as per mysql_tuner.pl advice with no success.
We will obviously switch over to updated librenms-service.py
(and quite possibly switching is going to be forced to carry out ASAP now) and let you know if this solves the issue.