Poller failure due to influxdb disk quota exceeded error

dlz23 · 16 July 2024 22:08

pjchilds · 17 July 2024 03:48

Do you have a specific question?

dlz23 · 17 July 2024 13:03

We have about 2000 devices and recently been having lots of issues with polling and keeping the poller running.

When I check the poller_wapper.log I am seeing lots of “disk quota exceeded” errors and some times mysql database (1040, too many connections’)

We have increased the mysql database default connections.

We have 32gb or ram on a VM with 2 sockets, 12 cores.

Any recommendations on what is happening here or if we need to change more settings to handle this many deceives?

slashdoom · 17 July 2024 14:55

Have you enable exporting of poller data to an external InfluxDb system? Maybe try disabling that first to see if the poller keeps up on it’s own to the internal MySQL and RRD data stores. If so you know that the Influx warning are the cause and you can work on addressing the disk quota and ingest settings on that.

dlz23 · 17 July 2024 15:16

We are not exporting poller data to an external InfluxDb, we are running it local.

Thank you for the idea.

slashdoom · 17 July 2024 15:24

Meaning you installed InfluxDb on your LibreNMS server and then configured something like this to ship metrics to Influx?

If so, I would still disable that temporarily to see if the poller works without metrics shipping first and go from there.

dlz23 · 18 July 2024 23:38

Soon as I disabled InfluxDBv2 the poller started to work just fine. recommendations ?

pjchilds · 24 July 2024 23:11

You need to get a handle on what is happening with your influx and why it is generating disk quota exceeded messages (see influxdb logs, disk uage, and have you got quota’s on your filesystems etc)

From an operational availability perspective I would put a telegraf instance between LibreNMS and influxdb – so the telegraf instance will take the influxdb updates and batch them for sending to influx. This way if influxdb is slow or unavailable it won’t stop your pollers from working. We have a telegraf instance co-deployed with each poller.

You can setup mysql monitoring in LibreNMS and keep an eye on the ‘max connnections’ , ‘max used connections’ etc… we have our set to 4,000 using about 2.95k connections – from memory you get one connection per poller thread (so if you have 10 pollers with 90 threads that is 960 connections etc)

If you have the whole deployment on a single device you may be running into various limits.

If you are using redis we needed to shift our connection limit to 12k (currently 8k connected) and adjust the /etc/security/limits.d/redis.conf file

We use distributed pollers (54) and a high spec central server to run rrdcached/mysql/influxdb which has all its IO backed onto a large multiple NVMe array.

We are doing ~6k devices, ~480k ports

system · 22 October 2024 23:12

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.