Distributed poller - Only 1 poller can write to rrdcache server

I have configured distributed polling following the documentation. I currently have one main server acting as the GUI and MySQL server, a dedicated rrdcached server, a dedicated Redis server, and 4 pollers each on separate VMs with identical specs (4vCPU/8G RAM).

Each poller has an identical config.php

<?php
$config['user'] = 'librenms';
$config['distributed_poller'] = true;
$config['distributed_poller_name']           = php_uname('n');
$config['distributed_poller_group']          = '1';
$config['rrdtool_version'] = '1.7.2';
$config['rrdcached']    = "$rrdcached_server:42217";

And identical .env file except for the NODE_ID and VAPID keys

DB_HOST=$libre-master
DB_DATABASE=librenms
DB_USERNAME=$db-User
DB_PASSWORD=$db-Pass
APP_KEY=$APP-KEY
REDIS_HOST=$redis_host
REDIS_PORT=6379
REDIS_DB=0
REDIS_TIMEOUT=60
CACHE_DRIVER=redis
INSTALL=true
NODE_ID=$node_id
VAPID_PUBLIC_KEY=$public_key
VAPID_PRIVATE_KEY=$private_key
LIBRENMS_USER=librenms

validate.php output for both of the currently active pollers looks identical as well

===========================================
Component | Version
--------- | -------
LibreNMS  | 23.11.0-24-gf6e7795ca (2023-12-07T11:58:38+00:00)
DB Schema | 2023_11_21_172239_increase_vminfo.vmwvmguestos_column_length (274)
PHP       | 8.2.13
Python    | 3.8.10
Database  | MariaDB 10.4.22-MariaDB-1:10.4.22+maria~focal-log
RRDTool   | 1.7.2
SNMP      | 5.8
===========================================

[OK]    Composer Version: 2.6.5
[OK]    Dependencies up-to-date.
[OK]    Database connection successful
[OK]    Database Schema is current
[OK]    SQL Server meets minimum requirements
[OK]    lower_case_table_names is enabled
[OK]    MySQL engine is optimal
[OK]    Database and column collations are correct
[OK]    Database schema correct
[OK]    MySQl and PHP time match
[OK]    Distributed Polling setting is enabled globally
[OK]    Connected to rrdcached
[OK]    Active pollers found
[FAIL]  Some dispatcher nodes have not checked in recently
  Inactive Nodes:
   libre-poller01 (node is shut down)
   libre-poller02 (node is shut down)
[OK]    Locks are functional
[OK]    Python wrapper cron entry is not present
[OK]    Redis is functional
[WARN]  IPv6 is disabled on your server, you will not be able to add IPv6 devices.
[OK]    rrdtool version ok
[OK]    Connected to rrdcached

I am able to see that on non-working-poller it completes a poll for a certain device but I see no data being actually added to the rrd files, just -nan for every value. However the other poller is able inserting values without an issue.
So I did a manual poll from the non-working-poller

lnms device:poll -vvv 696

And I can see that it gets values and sends the rrd update statements and gets an OK response

...
RRD[last $target_host/$rrdcached_host/netstats-ip.rrd  --daemon $rrdcached_host:42217]
RRDtool Output: 1702062900
OK u:0.00 s:0.01 r:37.14
RRD[update $target_host/$rrdcached_host/netstats-ip.rrd N:2538396:381353314:1601681208:510048788:0:0:51739749:0:0:0:4130:0:8260:18486:0:0 --daemon $rrdcached_host:42217]
RRDtool Output: OK u:0.00 s:0.01 r:29.81
...
SNMP [40/34.57s]: Snmpget[21/11.62s] Snmpwalk[19/22.96s]
SQL [106/342.85s]: Select[77/250.11s] Insert[1/3.20s] Update[26/83.13s] Delete[2/6.41s]
RRD [110/8.88s]: Other[55/8.87s] Update[55/0.01s]

But then the data never shows up in the RRD file, I just see -nan like so

1702061700: -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan
1702062000: -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan
1702062300: -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan
1702062600: -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan
1702062900: -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan
1702063200: -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan
1702063500: -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan

But, doing a manual update using rrdtool seems to work

1702061700: -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan
1702062000: -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan
1702062300: -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan
1702062600: -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan
1702062900: -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan
1702063200: 0.0000000000e+00 1.4316551541e+07 1.4316518839e+07 1.4316548976e+07 0.0000000000e+00 0.0000000000e+00 1.4316556583e+07 0.0000000000e+00 0.0000000000e+00 0.0000000000e+00 0.0000000000e+00 0.0000000000e+00 0.0000000000e+00 1.4316557649e+07 0.0000000000e+00 0.0000000000e+00
1702063500: -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan

I can’t see anything in any log that points to what the issue might be. I’ve created new pollers and ensured that they can all connect to the rrdcached server using netcat on the necessary port and get answers back

librenms@libre-poller02:~$ nc $rrdcached_host 42217
stats
9 Statistics follow
QueueLength: 0
UpdatesReceived: 78294875
FlushesReceived: 0
UpdatesWritten: 55355747
DataSetsWritten: 77622361
TreeNodesNumber: 198490
TreeDepth: 21
JournalBytes: 0
JournalRotate: 0

Also, in the GUI the working poller seems to be polling ~550 devices efficiently, whereas the non-working poller seems to be struggling with just over 60 devices

Any help anybody could provide would be much appreciated.

Are you redacting the actual rrdcached_server address or have you literally used a variable fo it?

Sorry, should have mentioned, anything in this post that looks like a variable is just a redacted name I didn’t want to include.

time synced on all nodes is all I can think of off the top of my head?

are all the pollers in the same poller group?

Thanks for the reply, I did check time on all nodes and all are synced. At present I have two poller groups, and the one that seems to work properly is in Group 1 while the non-working poller is in Group 2.
But, the question about poller groups made me wonder if it’s an issue with the groups so I ended up setting all my hosts to poller group 1 and set 4 pollers to action that group. I see data flowing into my rrd files now, time will tell if that’s the solution.

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.