Distributed polling graphing oddly

SteveK · 8 August 2023 13:30

$ ./validate.php 
===========================================
Component | Version
--------- | -------
LibreNMS  | 23.7.0-65-g6ad3ff9b9 (2023-08-08T05:36:23-04:00)
DB Schema | 2023_08_02_120455_vendor_ouis_unique_index (255)
PHP       | 8.2.8
Python    | 3.11.2
Database  | MariaDB 10.11.3-MariaDB-1-log
RRDTool   | 1.7.2
SNMP      | 5.9.3
===========================================

[OK]    Composer Version: 2.5.8
[OK]    Dependencies up-to-date.
[OK]    Database connection successful
[OK]    Database Schema is current
[OK]    SQL Server meets minimum requirements
[OK]    lower_case_table_names is enabled
[OK]    MySQL engine is optimal
[OK]    Database and column collations are correct
[OK]    Database schema correct
[OK]    MySQl and PHP time match
[OK]    Distributed Polling setting is enabled globally
[OK]    Connected to rrdcached
[OK]    Active pollers found
[OK]    Dispatcher Service is enabled
[OK]    Locks are functional
[OK]    Python wrapper cron entry is not present
[OK]    Redis is functional
[OK]    rrdtool version ok
[OK]    Connected to rrdcached

poller01:~$ ./validate.php 
===========================================
Component | Version
--------- | -------
LibreNMS  | 23.7.0-65-g6ad3ff9b9 (2023-08-08T05:36:23-04:00)
DB Schema | 2023_08_02_120455_vendor_ouis_unique_index (255)
PHP       | 8.2.8
Python    | 3.11.2
Database  | MariaDB 10.11.3-MariaDB-1-log
RRDTool   | 1.7.2
SNMP      | 5.9.3
===========================================

[OK]    Composer Version: 2.5.8
[OK]    Dependencies up-to-date.
[OK]    Database connection successful
[OK]    Database Schema is current
[OK]    SQL Server meets minimum requirements
[OK]    lower_case_table_names is enabled
[OK]    MySQL engine is optimal
[OK]    Database and column collations are correct
[OK]    Database schema correct
[OK]    MySQl and PHP time match
[OK]    Distributed Polling setting is enabled globally
[OK]    Connected to rrdcached
[OK]    Active pollers found
[OK]    Dispatcher Service is enabled
[OK]    Locks are functional
[OK]    Python wrapper cron entry is not present
[OK]    Redis is functional
[OK]    rrdtool version ok
[OK]    Connected to rrdcached

new install, distributed polling 1 (same as the web GUI) and 1 on another VM
both pollers are on the same version when I poll from the default group (contains both pollers) graphing works fine, I created two group with only one poller in each added a devices to the group and noticed it was lo longer graphing, I moved to the other group and still no graphing. Switched it back to the default and it graphs.

I tail the rrd.journal file I can see when I make the change to the single poller group and I can see it not getting data, I move it back to the default and it graphs again.

I check rrdcached and it looks fine

$ lnms config:get rrdcached 
10.2.1.43:42217

same IP as the what is running rrdcached and the webUI
I see connections into RRD and it works fine on the default poller group, just not one containing one poller

krisl · 10 August 2023 21:12

Hey Steve,

have you tested (from the second poller) that your rrdcached port is in fact accessible from the poller, e.g.

nc -v 10.2.1.43 42217

You might also want to check that its listening as well. I forgot to include the stanza in my rrdcached default conf file and ended up with a very similar issue, although my symptom was gaps. The second poller wasnt able to connect to the rrdcached… Check you have:

NETWORK_OPTIONS="-l x.x.x.x -L 42217"

in the /etc/default/rrdcached file, and check once the service is running that you can see the port:

ss -pan | grep 42217 | grep -i listen

(I’m using Ubuntu 22.04)

I hope that is some help
Chris

SteveK · 11 August 2023 12:18

# ss -pan | grep 42217 | grep -i listen
tcp   LISTEN    0      511                                                     0.0.0.0:42217            0.0.0.0:*         users:(("rrdcached",pid=4020505,fd=4))                                                                                                                                                                                                                                                                                                                                     
tcp   LISTEN    0      511                                                        [::]:42217               [::]:*         users:(("rrdcached",pid=4020505,fd=5))

yes it is listening

funny thing is even if I make a group using the main librenms server (( it has the rrdcached running on it )) I still get no graphs

krisl · 11 August 2023 15:01

so it is listening… can we check that the right output is observed for the config. Run the following as the librenms user:

su - librenms
lmns config:get rrdcached

I get something along the lines of

librenms@libre01:~$ lnms config:get rrdcached
10.2.2.2:42217
librenms@libre01:~$

I would try that from your other poller as well… just to be sure. As you know the port is listening, can your boxes talk to it? On the main box and the poller box, try the following which will send the ASCII string stats to the rrdcached and hopefully bring back something whilst testing the socket.

echo STATS|nc -vN 10.x.x.x 42217

For example I see:

librenms@libre01:~$ echo STATS|nc -vN 10.2.2.2 42217
Connection to 10.2.2.2 42217 port [tcp/*] succeeded!
9 Statistics follow
QueueLength: 0
UpdatesReceived: 2258071
FlushesReceived: 0
UpdatesWritten: 49406
DataSetsWritten: 2183723
TreeNodesNumber: 2553
TreeDepth: 13
JournalBytes: 203239629
JournalRotate: 22
librenms@libre01:~$

If that works, that means the socket is of course accessible and rrdcached is behind it too… if all good, try and enumerate the RRD files by sending a LIST RECURSIVE /

echo LIST RECURSIVE /|nc -vN 10.x.x.x 42217

you should see a shed load of filenames returned that match the graphs for your devices…

Let us know what you find!

hope that helps a bit

SteveK · 11 August 2023 15:21

$ echo STATS|nc -vv 10.2.1.43 42217
librenms.truestream.us [10.2.1.43] 42217 (?) open
9 Statistics follow
QueueLength: 0
UpdatesReceived: 41563081
FlushesReceived: 0
UpdatesWritten: 2570311
DataSetsWritten: 41272949
TreeNodesNumber: 26854
TreeDepth: 18
JournalBytes: 3988972433
JournalRotate: 75

also did the same from the poller

librenms-poller01:~$ echo STATS|nc -vv 10.2.1.43 42217
librenms.truestream.us [10.2.1.43] 42217 (?) open
9 Statistics follow
QueueLength: 0
UpdatesReceived: 41644154
FlushesReceived: 0
UpdatesWritten: 2575260
DataSetsWritten: 41354010
TreeNodesNumber: 26854
TreeDepth: 18
JournalBytes: 3996752732
JournalRotate: 75
sent 6, rcvd 210

Working as expected
again I get graphs when I use the pollers in a group both the one on the same server as the webUI and the remote but with I pick one or the other I get no graphs

krisl · 11 August 2023 15:29

I not totally sure I understand, when you mention “pick” one of them, are you putting that poller specifically into a group? Are their devices associated with that group and does it affect just those devices? I would like to try and see if I can replicate that…

SteveK · 11 August 2023 16:11

2 poolers I created 2 new polling groups one the main server, and the the other the the remote poller, the default group having both graphs the other 2 do not

krisl · 11 August 2023 16:46

So you have 2 additional pollers making up 3 in total if you count your main webgui, and 2 additional groups so 3 groups in total, default, one and two?

and each poller (including the main webgui node) is looking at just one group? Could you cat the config.php from each of your nodes please, I’m interested in the $config[‘distributed_poller_group’] param.

Also are you using dispatcher service rather than the regular pollers?

SteveK · 11 August 2023 17:30

just 2 pollers
it graphs fine with the default group

I have removed the other groups because they were not working as expected and I do not have $config[‘distributed_poller_group’] in my config.php this install I have only been using lnms to configure the service to keep it cleaner.

krisl · 11 August 2023 17:36

So I modified my testbed and came up with a similar issue with graphs not updating, but the issue was the devices simply not being polled.

my env is comprised of three libre instances
01 - libre poller and webgui
02 - libre poller and webgui (keepalived backup)
03 - libre only

All three nodes have the dispatcher service running and are visible on the poller page in a cluster with a master and two slaves, and by default were all group 0. So far so good, if I take a node out by stopping the service, the remaining two pollers pick up the slack.

I tried what you indicated, which was to put one of the pollers in its own group. So I set the

$config[‘distributed_poller_group’] = '6';

on that poller and thought it would be best to restart the service. I also put one of the devices in that group. I was expecting to see that node on polling one device, and I ran a packet capture on the device to make sure that the other nodes werent talking to it. What I saw instead was:

poller 03 continuing to poll lots of things
the device in its own group NEVER getting polled.

I had a look back at the docs and it defo indicates on Dispatcher Service (RC) - LibreNMS Docs or purports that $config[‘distributed_poller_group’]; is the way to go, although I notice on the poller settings there is a dropdown to configure the groups. “may be its that? I said”. So I tried changing it in the UI, waited for the green update to fire and it looked like it was ok.

navigating away from the page made it seem to be default again though. Back to square one.

I decided to take a peek in the database. There is a table called poller_cluster. When I queried it came up with this:

MariaDB [librenms]> select id,node_id,poller_name,poller_groups from poller_cluster;
+------+---------------+----------------+---------------+
| id   | node_id       | poller_name    | poller_groups |
+------+---------------+----------------+---------------+
|    1 | 64d4e59f74a36 | libre01 | 0             |
|   43 | 64d50ea364cd4 | libre02 | 0             |
| 1213 | 64d5326ba0ef0 | libre03 | 0             |
+------+---------------+----------------+---------------+
3 rows in set (0.000 sec)

MariaDB [librenms]>

I was brave and made an update in there to force my group for libre03 to be 6

MariaDB [librenms]> update poller_cluster set poller_groups=6 where id =1213;
Query OK, 1 row affected (0.001 sec)
Rows matched: 1  Changed: 1  Warnings: 0

MariaDB [librenms]> select id,node_id,poller_name,poller_groups from poller_cluster;
+------+---------------+----------------+---------------+
| id   | node_id       | poller_name    | poller_groups |
+------+---------------+----------------+---------------+
|    1 | 64d4e59f74a36 | libre01 | 0             |
|   43 | 64d50ea364cd4 | libre02 | 0             |
| 1213 | 64d5326ba0ef0 | libre03 | 6             |
+------+---------------+----------------+---------------+
3 rows in set (0.000 sec)

and this followed through on the UI in the poller page… for five mins… and then went back to default again, and the database went back to default too. I retried but this time stopped ALL poller services, updated the DB and restarted services. Now we have more success. I can see just this device being polled by libre03 and nothing else, and the graphs are updating (because its being polled)…

so summary is, looks like a: the config.php value is ignored when using the dispatcher scripts and b: there seems to be some problem with making a change like that (which I would think should be simpler). I don’t believe I have anything wrong in my setup but at least I know this is a fix that works.

let me know if that helps…

krisl · 15 August 2023 09:50

little bit more testing, this seems to toggle being a problem (or not) by the very existence of the:

$config['distributed_poller_group']

being in the config.php. I don’t believe I misunderstood the documentation that suggested that is supposed to be in there, but if present (and set to say 0), this appears to override the desired poller group specified in the poller settings. If the DB is updated manually it is again overridden. I have mine commented out and restarting the librenms dispatcher service resolves this issue.

Just checking if that issue was the core issue you were having or is there still something else up with your graphing.

system · 13 November 2023 09:50

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.