MariaDB Galera Cluster - General error: 1205 Lock wait timeout exceeded

dbray925 · 9 December 2020 20:47

After migrating the DB to a MariaDB Galera Cluster, used by many other applications and services, we are noticing a bunch of timeout errors only on the LibreNMS application:

Next Doctrine\DBAL\Driver\PDO\Exception: SQLSTATE[HY000]: General error: 1205 Lock wait timeout exceeded; try restarting transaction in /opt/librenms/vendor/doctrine/dbal/lib/Doctrine/
DBAL/Driver/PDO/Exception.php:18

Next Illuminate\Database\QueryException: SQLSTATE[HY000]: General error: 1205 Lock wait timeout exceeded; try restarting transaction (SQL: update config set config_value = “fping6”
where config_id = 796) in /opt/librenms/vendor/laravel/framework/src/Illuminate/Database/Connection.php:671

[2020-12-09 20:14:15] production.ERROR: PDOException: SQLSTATE[HY000]: General error: 1205 Lock wait timeout exceeded; try restarting transaction in /opt/librenms/vendor/doctrine/dbal/
lib/Doctrine/DBAL/Driver/PDOStatement.php:115

This is a distributed poller config, with
1x Web/RRDCache/MEMCache Server
3x Poller Servers
2x ProxySQL Servers
3x MariaDB Galera Servers

./validate.php

Component	Version
LibreNMS	1.70.1-1-ga3635d0b7
DB Schema	2020_10_12_095504_mempools_add_oids (191)
PHP	7.4.13
Python	3.6.8
MySQL	10.4.14-MariaDB-log
RRDTool	1.7.0
SNMP	NET-SNMP 5.8
====================================

[OK] Composer Version: 2.0.8
[OK] Dependencies up-to-date.
[OK] Database connection successful
[OK] Database schema correct
[INFO] Detected Python Wrapper
[OK] Connection to memcached is ok

The ProxySQL Servers are configured with the proper Read/Write split rules, and Galera Cluster monitoring, that makes the primary DB node the writer, and the other 2 DB nodes the readers. The MariaDB is not a very busy cluster, but it does have other applications hitting it (Nagios, in-house developed, etc.) One of the in-house developed apps is a MUCH busier application than LibreNMS, and it is not having any sort of timeout issues. When I say busier, I mean it is a Python script dumping over 10,000 rows of (simple) data every 5 minutes.

LibreNMS is monitoring around 365 devices, mainly Cisco switches/routers, PaloAlto FW, and Linux servers. The pollers are not reporting these 1205 Lock wait timeout errors, only the Web/RRDCache/MEMCache server. Is there some additional tweaks needed to get this a little more stable with the DB cluster? We have tried to tweak the DB as much as possible, but with only LibreNMS seeing an issue, I’m hoping someone else has an idea on the LibreNMS side.

zombah · 23 December 2020 12:15

Hello @dbray925,
You didn’t mention which db locking mechanism you use?
Currently redis is prefered one, check docs here and it’s PR here and this closed PR with looks same as your problem here

dbray925 · 23 December 2020 13:26

Thank you @zombah I guessed I missed that it wasn’t reported with the ./validate script. I thought that “[OK] Connection to memcached is ok” indicated that I was running:
CACHE_DRIVER=memcached

As for redis, I’m a little unclear about the documentation. The drawing shows memcache is installed on the RRD Storage, which in my case is the “Web/RRDCache/MEMCache Server”. However, scrolling down a little and the example states “Database Server: Running Memcache and MariaDB”. So, if I wanted to switch over to redis, does that get installed on the actual MariaDB servers, which doesn’t seem right, or do I install that on the LibreNMS discovery and poller servers?

Mr.Covaci · 13 January 2021 20:45

I have the same problem, except that I run Galera on 5 physical distinct nodes, and I don’t have so much applications, but I observed that when there are multiple users running some reports that make multiple “CREATE TABLE” and inserts, all users that refreshes the application receive “SQLSTATE[HY000]: General error: 1205 Lock wait timeout exceeded; try restarting transaction”…
The first time I was thinking of wsrep_sync_method = rsync may be the issue… or innodb_lock_wait_timeout needs to be increased… but these didnt helped…

Did you find a solution to your cluster?

dbray925 · 13 January 2021 20:58

No, the problem remains, but then again I’m still using CACHE_DRIVER=memcached and haven’t tried REDIS yet. I was waiting for clarification on the install, and then I would give that a shot.

dbray925 · 28 January 2021 18:50

Even after update the CACHE_DRIVER to an updated install of REDIS, this problem remains. I made sure REDIS had all the recommended settings enabled on the main LibreNMS server, and I pointed all my pollers at that install. Unfortunately, nothing helped.

paulierco · 29 January 2021 19:23

Can you try to disable redis by putting # in front of CACHE_DRIVER?
This will change from redis to sql.

dbray925 · 29 January 2021 19:39

Wait, you mean completely disable the locking mechanism settings? That goes against the documentation: https://docs.librenms.org/Extensions/Distributed-Poller/

That is a little confusing.

paulierco · 29 January 2021 20:52

Is not clear in docs. Removing that will switch locking to sql. (default)

dbray925 · 1 February 2021 17:49

Commented out the REDIS settings, however the problem remains.

system · 1 February 2023 17:49

This topic was automatically closed 730 days after the last reply. New replies are no longer allowed.