Multiple errors in ./validate.php

Hi @kalamchi75

I have finally got my CentOS instance and I did all the skeletal configs from the install documents.

Now I will start with editing the .env and also install the memcached and the rrdcached on this instance.

I’ll put in the configs like you showed me and then come back with what I have.

Thanks for all the help Sir. :slight_smile:

Santosh Kotla

You will need to enable distributed pollers in you master server (inside /opt/librenms/config.php:

# Enable Distributed Pollers
$config['distributed_poller'] = true;
$config['rrdcached']    = "localhost:42217";
$config['distributed_poller_memcached_host'] = 'localhost';
$config['distributed_poller_memcached_port'] = '11211';

and don’t forget to allow the remote poller access inside your Master server’s MySQL

  GRANT ALL ON librenms.* TO [email protected]<Remote-IP> IDENTIFIED BY 'SomePassword';

if everything is ok, the remote pollers will pop automatically in your Master’s GUI:

Good luck

Hi @kalamchi75

I have put in the configs as suggested and I see that the poller is showing up in the central server poller list. But when I made the suggested changes to the rrdcached, it broke the graphs. I get the error message that they’re not able to connected to rrdcached.

Do you know where I could be going wrong?

===========

Enable the in-built services support (Nagios plugins)

$config[‘show_services’] = 1;
#$config[“rrdcached”] = “unix:/var/run/rrdcached/rrdcached.sock”;
$config[“update_channel”] = “release”;
$config[“enable_syslog”] = 1;

#Syslog requirements
$config[‘enable_syslog’] = 1;

// Distributed Poller-Settings
$config[‘distributed_poller’] = true;
// optional: defaults to hostname
$config[‘distributed_poller_name’] = php_uname(‘n’);
$config[‘distributed_poller_group’] = ‘0’;
$config[‘rrdcached’] = “localhost:42217”;
$config[‘distributed_poller_memcached_host’] = ‘localhost’;
$config[‘distributed_poller_memcached_port’] = ‘11211’;

rrdcached config

DAEMON=/usr/bin/rrdcached
DAEMON_USER=librenms
DAEMON_GROUP=librenms
WRITE_THREADS=4
WRITE_TIMEOUT=1800
WRITE_JITTER=1800
BASE_PATH=/opt/librenms/rrd/
JOURNAL_PATH=/var/lib/rrdcached/journal/
PIDFILE=/run/rrdcached.pid
SOCKFILE=/run/rrdcached.sock
SOCKGROUP=librenms
BASE_OPTIONS="-B -F -R"

BASE_OPTIONS="-l 0:42217"
BASE_OPTIONS="$BASE_OPTIONS -R -j /var/lib/rrdcached/journal/ -F"
BASE_OPTIONS="$BASE_OPTIONS -b /opt/librenms/rrd -B"
BASE_OPTIONS="$BASE_OPTIONS -w 1800 -z 900"

Poller:

Also the directory in the poller seemed to have filled up pretty quick…

[[email protected] opt]# df -h
Filesystem Size Used Avail Use% Mounted on
devtmpfs 5.9G 0 5.9G 0% /dev
tmpfs 5.9G 0 5.9G 0% /dev/shm
tmpfs 5.9G 25M 5.9G 1% /run
tmpfs 5.9G 0 5.9G 0% /sys/fs/cgroup
/dev/mapper/cs-root 40G 1.9G 39G 5% /
/dev/sda1 3.0G 198M 2.8G 7% /boot
/dev/mapper/cs-home 5.0G 68M 5.0G 2% /home
/dev/mapper/cs-opt 15G 15G 20K 100% /opt
/dev/mapper/cs-var 15G 445M 15G 3% /var
tmpfs 1.2G 0 1.2G 0% /run/user/0
tmpfs 1.2G 0 1.2G 0% /run/user/993

Thanks,
Santosh Kotla

Hi @kalamchi75

Whenever I try to set my rrdcached statement to localhost:42217, the graphs are breaking with and ERROR: Unable to connect to rrdcached.

I have put in the right configs in the poller instance and then I did a validate but that wiped out my whole config file for some reason. Also I see that my storage got 50% full overnight. I have 300+GB of storage space on the server.

[[email protected] librenms]# df -h
Filesystem Size Used Avail Use% Mounted on
devtmpfs 5.9G 0 5.9G 0% /dev
tmpfs 5.9G 0 5.9G 0% /dev/shm
tmpfs 5.9G 210M 5.7G 4% /run
tmpfs 5.9G 0 5.9G 0% /sys/fs/cgroup
/dev/mapper/cs-root 40G 1.9G 39G 5% /
/dev/mapper/cs-home 5.0G 68M 5.0G 2% /home
/dev/sda1 3.0G 198M 2.8G 7% /boot
/dev/mapper/cs-var 15G 584M 15G 4% /var
/dev/mapper/cs-opt 314G 149G 166G 48% /opt
tmpfs 1.2G 0 1.2G 0% /run/user/0
tmpfs 1.2G 0 1.2G 0% /run/user/993

The server was sluggish and I had to force delete the folder librenms for the server to respond again and for the storage utilization to go down. I don’t know what’s causing this issue and I have tried to check all the things that I configured.

Please let me know how to proceed Sir.

Thanks,
Santosh Kotla

Hi Santosh,

First, run :
systemctl status rrdcached
make sure the service is up and running

Then, try to replace the localhost with the server’s actual IP address in your configuration lines and see if that would solve the connection issue to the rrdcached service.

also, the last few lines in my rrdcached config look like this:

# Any other options not specifically supported by the script (-P, -f,
# -F, -B).
BASE_OPTIONS="-B -F -R"
OPTS="-l 0:42217"
OPTS="$OPTS -R -j /var/lib/rrdcached/journal/ -F"
OPTS="$OPTS -b /opt/librenms/rrd -B"
OPTS="$OPTS -w 1800 -z 900"

You might want to use them.
Tip: don’t break your config file, make a backup copy of it before editing, so you have a rollback file should things go wrong.

Once done, please share your ./validate.php result from the master server.

Let’s fix this one first and then we look at the load and disk usage issue.

Hi @kalamchi75

This is how my rrdcached file looks like at the master.

DAEMON=/usr/bin/rrdcached
DAEMON_USER=librenms
DAEMON_GROUP=librenms
WRITE_THREADS=4
WRITE_TIMEOUT=1800
WRITE_JITTER=1800
BASE_PATH=/opt/librenms/rrd/
JOURNAL_PATH=/var/lib/rrdcached/journal/
PIDFILE=/run/rrdcached.pid
SOCKFILE=/run/rrdcached.sock
SOCKGROUP=librenms
BASE_OPTIONS="-B -F -R"

BASE_OPTIONS="-l 0:42217"
BASE_OPTIONS="$BASE_OPTIONS -R -j /var/lib/rrdcached/journal/ -F"
BASE_OPTIONS="$BASE_OPTIONS -b /opt/librenms/rrd -B"
BASE_OPTIONS="$BASE_OPTIONS -w 1800 -z 900"

I have tried to replace the localhost with the IP address as well as the hostname.companydomain.com for which I have a DNS entry. Both of them didn’t work for me.

This is with the IP address of the master server:

[email protected]up-vlibrenms01:~# systemctl status rrdcached
● rrdcached.service - Data caching daemon for rrdtool
Loaded: loaded (/etc/systemd/system/rrdcached.service; enabled; vendor preset: enabled)
Active: active (running) since Wed 2021-09-08 16:55:32 UTC; 16h ago
Process: 17142 ExecStart=/usr/bin/rrdcached -w 1800 -z 1800 -f 3600 -s librenms -U librenms -G librenms -B -R -j /
Main PID: 17148 (rrdcached)
Tasks: 525 (limit: 4915)
CGroup: /system.slice/rrdcached.service
└─17148 /usr/bin/rrdcached -w 1800 -z 1800 -f 3600 -s librenms -U librenms -G librenms -B -R -j /var/tmp

Sep 09 09:37:21 dc5up-vlibrenms01 rrdcached[17148]: handle_request_update: Could not read RRD file.
Sep 09 09:37:22 dc5up-vlibrenms01 rrdcached[17148]: handle_request_update: Could not read RRD file.
Sep 09 09:37:22 dc5up-vlibrenms01 rrdcached[17148]: handle_request_update: Could not read RRD file.
Sep 09 09:37:56 dc5up-vlibrenms01 rrdcached[17148]: handle_request_update: Could not read RRD file.
Sep 09 09:38:01 dc5up-vlibrenms01 rrdcached[17148]: handle_request_update: Could not read RRD file.
Sep 09 09:38:01 dc5up-vlibrenms01 rrdcached[17148]: handle_request_update: Could not read RRD file.
Sep 09 09:38:36 dc5up-vlibrenms01 rrdcached[17148]: handle_request_update: Could not read RRD file.
Sep 09 09:38:36 dc5up-vlibrenms01 rrdcached[17148]: handle_request_update: Could not read RRD file.
Sep 09 09:40:50 dc5up-vlibrenms01 rrdcached[17148]: handle_request_update: Could not read RRD file.
Sep 09 09:40:50 dc5up-vlibrenms01 rrdcached[17148]: handle_request_update: Could not read RRD file.

This is when I replaced it with the hostname.companydomain.com

[email protected]:~# systemctl status rrdcached
● rrdcached.service - Data caching daemon for rrdtool
Loaded: loaded (/etc/systemd/system/rrdcached.service; enabled; vendor preset: enabled)
Active: active (running) since Thu 2021-09-09 09:49:24 UTC; 33s ago
Process: 29833 ExecStart=/usr/bin/rrdcached -w 1800 -z 1800 -f 3600 -s librenms -U librenms -G librenms -B -R -j /
Main PID: 29836 (rrdcached)
Tasks: 2 (limit: 4915)
CGroup: /system.slice/rrdcached.service
└─29836 /usr/bin/rrdcached -w 1800 -z 1800 -f 3600 -s librenms -U librenms -G librenms -B -R -j /var/tmp

Sep 09 09:49:24 dc5up-vlibrenms01 systemd[1]: Starting Data caching daemon for rrdtool…
Sep 09 09:49:24 dc5up-vlibrenms01 systemd[1]: Started Data caching daemon for rrdtool.

[email protected]:~# ./validate.php
Do not run validate.php as root
[email protected]:~# su - librenms
[email protected]:~$ ./validate.php

Component Version
LibreNMS 21.8.0
DB Schema 2021_25_01_0129_isis_adjacencies_nullable (217)
PHP 7.3.25-1+ubuntu18.04.1+deb.sury.org+1
Python 3.6.9
MySQL 10.5.12-MariaDB-1:10.5.12+maria~bionic
RRDTool 1.7.0
SNMP NET-SNMP 5.7.3

====================================

[OK] Composer Version: 2.1.6
[OK] Dependencies up-to-date.
[OK] Database connection successful
[WARN] Your database schema has extra migrations (2020_06_24_155119_drop_ports_if_high_speed, 2021_08_26_093522_config_value_to_medium_text, 2021_25_01_0128_isis_adjacencies_add_admin_status, 2021_25_01_0129_isis_adjacencies_nullable). If you just switched to the stable release from the daily release, your database is in between releases and this will be resolved with the next release.
[OK] Database schema correct
[INFO] Detected Dispatcher Service
[WARN] IPv6 is disabled on your server, you will not be able to add IPv6 devices.
[FAIL] Cannot connect to rrdcached instance
[FAIL] Cannot connect to rrdcached instance
[WARN] Your local git contains modified files, this could prevent automatic updates.
[FIX]:
You can fix this with ./scripts/github-remove
Modified Files:
config/database.php

Thanks,
Santosh Kotla

Hi @kalamchi75

I reverted changes for the rrdcached and it is pointing to the default value that comes with the config.

My log file expanded to about 450M and I purged it and got the latest logs from it. This is what I see.

[2021-09-09 11:10:32] production.ERROR: Unable to launch a new process. {“exception”:"[object] (Symfony\Component\Process\Exception\RuntimeException(code: 0): Unable to launch a new process. at /opt/librenms/vendor/symfony/process/Process.php:350)
[stacktrace]
#0 /opt/librenms/vendor/symfony/process/Process.php(247): Symfony\Component\Process\Process->start(NULL, Array)
#1 /opt/librenms/includes/snmp.inc.php(375): Symfony\Component\Process\Process->run()
#2 /opt/librenms/includes/functions.php(503): snmp_check(Array)
#3 /opt/librenms/includes/functions.php(1791): isSNMPable(Array)
#4 /opt/librenms/includes/polling/functions.inc.php(292): device_is_up(Array, true)
#5 /opt/librenms/poller.php(140): poll_device(Array, false)
#6 {main}

If I put in a rrdcached status command it takes forever and never returns anything. The same for a restart. So I just tried to peek into the logs and the above is what I see.

I have ran a ./daily.sh update today as well. I also set the rrd_purge to 180 days so that I have about 6 months of data. I don’t know if that’s eating up a good amount of the space.

Do you think this instance needs a reboot? Your help to guide me out of this is much appreciated.

Thanks for your time.

Santosh Kotla

Hi Santosh,

how the server load looks like ?

try to run top and see if anything is driving the server’s resources to a choking point.

Hi @kalamchi75

This is what I see when I run top:

[email protected]:~$ top
top - 11:30:24 up 9 days, 1:46, 2 users, load average: 113.24, 84.64, 64.80
Tasks: 2148 total, 27 running, 1970 sleeping, 0 stopped, 0 zombie
%Cpu(s): 70.0 us, 17.7 sy, 0.0 ni, 7.8 id, 2.5 wa, 0.0 hi, 2.0 si, 0.0 st
KiB Mem : 65967468 total, 18676216 free, 38198880 used, 9092372 buff/cache
KiB Swap: 524284 total, 0 free, 524284 used. 27074436 avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
31746 librenms 20 0 16.293g 9.008g 19568 R 100.0 14.3 24:49.19 php
21285 librenms 20 0 295812 8268 2856 R 100.0 0.0 3016:10 php
22789 librenms 20 0 295812 8056 2640 R 100.0 0.0 3332:17 php
30086 librenms 20 0 295812 7712 2556 R 100.0 0.0 3201:12 php
11351 librenms 20 0 2404156 1.968g 19588 R 99.7 3.1 5:15.35 php
13291 librenms 20 0 16.293g 0.010t 19476 R 99.7 16.9 30:02.68 php
10159 librenms 20 0 304564 8068 2824 R 99.4 0.0 1900:26 php
25255 librenms 20 0 295812 8048 2632 R 99.4 0.0 6858:52 php
30854 librenms 20 0 4501308 3.645g 19388 R 99.0 5.8 10:14.52 php
24019 librenms 20 0 304564 8376 2852 R 98.7 0.0 1988:28 php
13597 librenms 20 0 438076 114740 19476 R 95.8 0.2 0:12.18 php
31898 librenms 20 0 4364696 46440 9396 S 63.5 0.1 659:16.76 python3
31808 mysql 20 0 10.707g 2.049g 9560 S 40.6 3.3 2613:42 mariadbd
539 root 19 -1 590284 77792 65136 S 16.5 0.1 1862:25 systemd-journal
16884 librenms 20 0 325672 45672 19760 S 14.8 0.1 0:00.46 php
16929 librenms 20 0 325672 45532 19600 S 14.5 0.1 0:00.45 php
16852 librenms 20 0 325672 45808 19872 S 13.2 0.1 0:00.41 php
23410 librenms 20 0 13.363g 145056 2916 S 11.3 0.2 18:24.53 rrdcached
16759 librenms 20 0 325672 46412 19756 S 10.3 0.1 0:00.32 php
943 redis 20 0 62448 9456 2484 R 8.4 0.0 1130:38 redis-server
16585 librenms 20 0 325672 46388 19740 S 6.1 0.1 0:00.35 php
13185 librenms 20 0 327784 47660 19944 S 3.5 0.1 0:00.51 php
14071 librenms 20 0 327784 48092 19908 S 3.5 0.1 0:00.41 php
5747 librenms 20 0 331880 52840 19788 R 3.2 0.1 0:00.83 php
2989 librenms 20 0 333928 53748 19968 R 2.9 0.1 0:01.07 php
6778 librenms 20 0 327784 48796 19772 S 2.9 0.1 0:00.87 php
7720 librenms 20 0 331880 52496 19684 R 2.9 0.1 0:00.63 php
6782 librenms 20 0 327784 48940 19708 S 2.6 0.1 0:00.73 php
9354 librenms 20 0 331880 52100 19940 R 2.6 0.1 0:00.54 php
13414 librenms 20 0 42956 6140 3184 R 2.6 0.0 0:00.54 top
13552 librenms 20 0 325736 46416 19576 R 2.6 0.1 0:00.48 php
838 root 20 0 1465176 123848 7224 S 2.3 0.2 65:35.38 syslog-ng
7860 librenms 20 0 327784 47048 19576 S 2.3 0.1 0:00.59 php
12349 librenms 20 0 327784 47196 19692 S 2.3 0.1 0:00.47 php
1166 librenms 20 0 333928 53616 19832 R 1.9 0.1 0:00.98 php
1192 librenms 20 0 331880 53260 20096 R 1.9 0.1 0:00.64 php
12115 librenms 20 0 327784 47872 19732 S 1.9 0.1 0:00.37 php
6144 librenms 20 0 327784 48868 19820 S 1.6 0.1 0:00.68 php
7769 librenms 20 0 327784 48320 19708 S 1.6 0.1 0:00.69 php
11273 librenms 20 0 325736 46872 19796 S 1.6 0.1 0:00.48 php
11582 librenms 20 0 327784 48608 19844 S 1.6 0.1 0:00.48 php
14072 librenms 20 0 325736 46788 19940 S 1.6 0.1 0:00.35 php
14150 librenms 20 0 325672 45616 19684 S 1.6 0.1 0:00.25 php
2961 librenms 20 0 327784 49204 20088 S 1.3 0.1 0:00.69 php
4190 librenms 20 0 327784 49148 20204 S 1.3 0.1 0:00.70 php
6193 librenms 20 0 327784 49136 20096 S 1.3 0.1 0:00.56 php
7939 librenms 20 0 327784 48168 19808 S 1.3 0.1 0:00.46 php
9134 librenms 20 0 327784 47768 19940 S 1.3 0.1 0:00.41 php
9313 librenms 20 0 327784 48260 19912 S 1.3 0.1 0:00.42 php
10428 librenms 20 0 325736 46632 19740 S 1.3 0.1 0:00.36 php
10444 librenms 20 0 331880 51524 19844 S 1.3 0.1 0:00.39 php
10472 librenms 20 0 327784 47580 19876 R 1.3 0.1 0:00.41 php
11056 librenms 20 0 325736 47232 19892 S 1.3 0.1 0:00.35 php
14022 librenms 20 0 325736 46616 19744 S 1.3 0.1 0:00.33 php
14062 librenms 20 0 325672 45556 19624 S 1.3 0.1 0:00.24 php
16960 librenms 20 0 19060 5572 4192 S 1.3 0.0 0:00.04 snmpbulkwalk
29715 librenms 20 0 335976 57484 19844 S 1.3 0.1 0:01.20 php
30089 root 20 0 0 0 0 D 1.3 0.0 1:06.47 kworker/u41:0

Your load is through the roof.
Do you have htop installed ? run htop, sort it by CPU usage and see what’s eating up the CPU

That’s from Htop…

Looks like the weathermap plugin is eating a lot of cycles for a very long time.

Thanks,
Santosh Kotla

indeed. Looks like WeatherMaps is choking your system. You really need to find out why.
Here is how htop looks on our master LNMS, mind you I also have few WeatherMaps polling:

Here is what you nee to do for now. split your issues into seperate ones, and troublshoot them one by one.
Disable any rrdcached config in your LibreNMS config. Leave it to later.
Find out what’s going on with WeatherMaps. You got to get the load down, otherwise this system will struggle badly.

What’s your poll cycle ? 5 minutes or less ?
How many polling threads are set to work simultaneously ?

Check those, you might be overloading the server with too many polls.

Once you get the load issue fixed and stable, you can move on and test rrdcached again.

Hi @kalamchi75

Will do. I will try working on the weathermaps and see what is causing the load.

As for the Polling cycle, i remember it being set to 1min. It was setup by an earlier member of the team who is not with the company anymore. But weathermaps has always been fine, not sure what is causing it to eat CPU like this…

Rrdcached has been restored to the default string for now.

Thanks,
Santosh Kotla

Here is a thought:
Check if your pollers are actually able to finish polling all the devices within one minute. If not, this will cause load to increase. Just a thought.
My server is set to poll every 5 minutes (i do know that some people want 1 minute poll) but I noticed that my pollers were not able to finish in one minute, but they are happy with 5 minutes.


as you can see my master poller needs 95 seconds to poll the 240+ devices assigned to it. Mind you, all those devices are in the same network/physical location.
So clearly in my case, 1 minute poll won’t cut it.

Investigate that

Hi @kalamchi75

Thanks for the suggestions. I have the central server polling about 1300 devices. On a second thought, I think the server is running on a 5 minute poll. I will check that once I am back to my desk as I have to go and bring things for the upcoming festival.

Once I have clarified the same, I will post my findings to you. Also I am running on a backup config as I have lost the entire config file during the database restructure. I will try to get that config from a storage backup. Unfortunately I need to depend on another team to get that.

I’ll post my findings to you here when I’m back to my desk.

Thanks,
Santosh Kotla

Hi Santosh,

Sure man. Good luck.
But just as I mentioned before, divide the issues and work them one by one, starting with stabilizing the load and making sure your LNMS is working/graphing properly. Then workout what to do next with rrdcached, remote pollers… etc.

Check your librenms.cron and inspect the following line

    */5  *    * * *   root    /opt/librenms/cronic /opt/librenms/poller-wrapper.py 16

16 is the number of concurrent poller processes. How many do you have ? perhaps you need to change that number to less and see if it reduces the load.

just a tip !

Hi @kalamchi75

Sure thing sir. I will take a look at that as well and get back to you in a day or two.

Thanks,
Santosh Kotla

Hi @kalamchi75

I took a look at the htop after a good 4 days and I see that the weathermap seems to be the culprit for the whole CPU utilization on 20 cores. Is there a way I can disable ONLY that specific PHP?

I tried using the command phpdismod but I got the error stating that it doesn’t see it under the list of mods available in php 7.3

[email protected]:~/html/plugins/Weathermap$ phpdismod map-poller.php
WARNING: Module map-poller.php ini file doesn’t exist under /etc/php/7.3/mods-available
WARNING: Module map-poller.php ini file doesn’t exist under /etc/php/7.3/mods-available
WARNING: Module map-poller.php ini file doesn’t exist under /etc/php/7.3/mods-available

These are all that are available under mods-available for me.

Thanks,
Santosh Kotla

Hi Santosh,

You can disable the plugin from the GUI. Overview → Plugins → Plugin Admin

This should disable Weathermaps. You can disable it for some time, say an hour or so and see how load changes.
As how to change the frequency of polling for the weathermaps, I am not really sure if you can do it specifically for that plugin, since I think it does actually use the main poller service frequency (in your case one minute polling i remember correctly).

Have you checked your librenms.cron for the number of concurrent poller instances ? :

    */5  *    * * *   root    /opt/librenms/cronic /opt/librenms/poller-wrapper.py 16

Hi @kalamchi75

I disabled the weathermap for now and this is what it looks like. Doesn’t look like anything changed wrt the load on the cores:

I Manually sent the SIGTERM command to them and the pollers don’t seem to be coming up again.

I also checked the cron file and it does have 16 at the end. I reduced this to an 8 but the comment above also mentions not to use this file.