After migrating the DB to a MariaDB Galera Cluster, used by many other applications and services, we are noticing a bunch of timeout errors only on the LibreNMS application:
Next Doctrine\DBAL\Driver\PDO\Exception: SQLSTATE[HY000]: General error: 1205 Lock wait timeout exceeded; try restarting transaction in /opt/librenms/vendor/doctrine/dbal/lib/Doctrine/
DBAL/Driver/PDO/Exception.php:18Next Illuminate\Database\QueryException: SQLSTATE[HY000]: General error: 1205 Lock wait timeout exceeded; try restarting transaction (SQL: update
config
setconfig_value
= “fping6”
whereconfig_id
= 796) in /opt/librenms/vendor/laravel/framework/src/Illuminate/Database/Connection.php:671[2020-12-09 20:14:15] production.ERROR: PDOException: SQLSTATE[HY000]: General error: 1205 Lock wait timeout exceeded; try restarting transaction in /opt/librenms/vendor/doctrine/dbal/
lib/Doctrine/DBAL/Driver/PDOStatement.php:115
This is a distributed poller config, with
1x Web/RRDCache/MEMCache Server
3x Poller Servers
2x ProxySQL Servers
3x MariaDB Galera Servers
./validate.php
Component | Version |
---|---|
LibreNMS | 1.70.1-1-ga3635d0b7 |
DB Schema | 2020_10_12_095504_mempools_add_oids (191) |
PHP | 7.4.13 |
Python | 3.6.8 |
MySQL | 10.4.14-MariaDB-log |
RRDTool | 1.7.0 |
SNMP | NET-SNMP 5.8 |
==================================== |
[OK] Composer Version: 2.0.8
[OK] Dependencies up-to-date.
[OK] Database connection successful
[OK] Database schema correct
[INFO] Detected Python Wrapper
[OK] Connection to memcached is ok
The ProxySQL Servers are configured with the proper Read/Write split rules, and Galera Cluster monitoring, that makes the primary DB node the writer, and the other 2 DB nodes the readers. The MariaDB is not a very busy cluster, but it does have other applications hitting it (Nagios, in-house developed, etc.) One of the in-house developed apps is a MUCH busier application than LibreNMS, and it is not having any sort of timeout issues. When I say busier, I mean it is a Python script dumping over 10,000 rows of (simple) data every 5 minutes.
LibreNMS is monitoring around 365 devices, mainly Cisco switches/routers, PaloAlto FW, and Linux servers. The pollers are not reporting these 1205 Lock wait timeout errors, only the Web/RRDCache/MEMCache server. Is there some additional tweaks needed to get this a little more stable with the DB cluster? We have tried to tweak the DB as much as possible, but with only LibreNMS seeing an issue, I’m hoping someone else has an idea on the LibreNMS side.