Can anyone assist here?any thoughts?
Experiencing the same issue here after OS upgrade, albeit on Ubuntu here.
Hi Mattias,
Are you having the same environment as in php and python version?
I have upgraded my servers from Centos 7.9 to RHEL 8.9. What’s your environment like?
Component | Version |
---|---|
LibreNMS | 24.3.0 (2024-04-01T17:18:44+02:00) |
DB Schema | 2024_02_07_151845_custom_map_additions (290) |
PHP | 8.1.10 |
Python | 3.8.10 |
Database | MariaDB 10.5.22-MariaDB-1:10.5.22+maria~ubu1804 |
RRDTool | 1.7.2 |
SNMP | 5.8 |
Upgraded from Ubuntu 18.04 to 20.04
Same problem here
===========================================
Component | Version |
---|---|
LibreNMS | 24.3.0 (2024-04-01T15:18:44+00:00) |
DB Schema | 2024_02_07_151845_custom_map_additions (290) |
PHP | 8.1.2-1ubuntu2.14 |
Python | 3.10.12 |
Database | MariaDB 10.6.16-MariaDB-0ubuntu0.22.04.1-log |
RRDTool | 1.7.2 |
SNMP | 5.9.1 |
=========================================== |
Ubuntu 22.04.4 LTS. Installed from zero
hi All,
Can anyone assist here as i know 4 environments including mine getting affected with defunct process post OS upgrade and it doesn’t constrict to any particular OS. My servers run on RHEL 8.9 and others run on Ubuntu majorly. @ItsmeTelemetry @mattias.agar do you guys use Dispatcher service or use the cron job for distributed pollers? as in my case i have the librenms service running on all my 15 pollers and i am not using the cronjob.
Also you can go through below article from other environment. May be that helps you.
Regards
Vatansha
I am using the service. Here we have 5 min poller for over 250 devices. Same as yours with smokeping integration and only stable versions monthly
I switched back to using cron job and it resolved all of my issues. No more Zombies
No issues on my Ubuntu server. I wonder what change they made would cause that?
Hi Murrant,
Can you go through the case logs from March 5th as i have mentioned all the information there. I have completed upgrading the WebUI server as well and now all the 15 pollers and WebUI are running on RHEL 8.9.
Earlier we were on Centos 7.9 which uses rhpython36 package, since it’s not available in RHEL 8.9 , so edited the librenms service file as below
ExecStart=/usr/bin/scl enable rh-python36 – /opt/librenms/librenms-service.py -v
To
ExecStart=/opt/librenms/librenms-service.py -v
I am not using cron for the distributed pollers, instead we have the librenms service running on all my pollers.
And we also installed php 8.1 by following steps given in [Installing LibreNMS - LibreNMS Docs] (Installing LibreNMS - LibreNMS Docs)
Centos8 and Apache section
ran below commands as given in the document
yum install python3-dotenv
dnf -y install epel-release
dnf -y install dnf-utils http://rpms.remirepo.net/enterprise/remi-release-8.rpm
dnf module reset php
dnf module enable php:remi-8.1
dnf install bash-completion cronie fping gcc git httpd ImageMagick mariadb-server mtr net-snmp net-snmp-utils nmap php-fpm php-cli php-common php-curl php-gd php-gmp php-json php-mbstring php-process php-snmp php-xml php-zip php-mysqlnd python3 python3-devel python3-PyMySQL python3-redis python3-memcached python3-pip python3-systemd rrdtool unzip php-fedora-autoloader-1.0.1-7.el8.noarch
php --version
PHP 8.1.27 (cli) (built: Dec 19 2023 20:35:55) (NTS gcc x86_64)
Copyright (c) The PHP Group
Zend Engine v4.1.27, Copyright (c) Zend Technologies
with Zend OPcache v8.1.27, Copyright (c), by Zend Technologies
Also attaching the latest validate.php and daily.sh for you to go through once.
It will be great if i can track what exactly is causing the zombies . It gets killed once we restart the services
@murrant completed the upgrade on March 21st and just now it got upgraded as well from 24.2.0 to 24.3.0, sharing the latest logs
[librenms@myserver ~]$ ./validate.php
Component | Version |
---|---|
LibreNMS | 24.3.0 (2024-04-01T17:18:44+02:00) |
DB Schema | 2024_02_07_151845_custom_map_additions (290) |
PHP | 8.1.27 |
Python | 3.6.8 |
Database | MariaDB 10.3.39-MariaDB-log |
RRDTool | 1.7.2 |
SNMP | 5.8 |
=========================================== |
[OK] Composer Version: 2.7.2
[OK] Dependencies up-to-date.
[OK] Database connection successful
[OK] Database Schema is current
[OK] SQL Server meets minimum requirements
[OK] lower_case_table_names is enabled
[OK] MySQL engine is optimal
[OK] Database and column collations are correct
[OK] Database schema correct
[OK] MySQL and PHP time match
[OK] Distributed Polling setting is enabled globally
[OK] Connected to rrdcached
[OK] Active pollers found
[OK] Dispatcher Service is enabled
[OK] Locks are functional
[OK] Python wrapper cron entry is not present
[OK] Redis is functional
[OK] rrdtool version ok
[OK] Connected to rrdcached
librenms@myserver~]$ ./daily.sh
Fetching new release information OK
Updating to latest release OK
Updating Composer packages OK
Updating SQL-Schema OK
Updating submodules OK
Cleaning up DB OK
Fetching notifications OK
Caching PeeringDB data OK
it happens around mid-night, what can i look for ? Let me know if any other logs you require.
Also it’s something related to pollers in general as on all my pollers, something runs at midnight which increases the defunct process count and it’s keep on getting piled up until i restart the librenms service on it. However i last restarted my WebUI server on April 10th and since then i have 107 defunct process on this machine till today.
But on all the pollers i need to restart the librenms service every 4th day.
Regards
Vatansha
Hmm, I’m using Python 3.10. Was there a bug in older python versions perhaps?
not sure about this too as all 4 of us are running different python versions.
I am using python 3.6.8
@mattias.agar is using python 3.8.10
@ItsmeTelemetry is using python 3.10.12 and your’s is also the same but he is also having the same issue.
which logs/processes can i look further to check?
and can i go ahead and try upgrading python to 3.10 on one of the poller machine and see if that would make any difference? Having said that it should not create any issue in the environment?
I am using python 3.10 and still have the zombie processes. As @Vatanasha said something is running at midnight that increases the Zombie processes. I have tried to increase the ‘pm. child’ on /etc/php/8.1/fpm/pool.d/www.conf and didn’t solve the issue. Zombie processes are related to the python3
|-python3(2148779)-±php(1318711)
librenms 1318711 0.0 0.0 0 0 ? Zs Apr18 0:02 [php]
Seems like the process are related to polling, so changing the fpm child limit will not help.
Yeah, it is the maintenance script.
It stops all pollers and relaunches the dispatcher service. It is supposed to wait from them to finish before relaunching the service, otherwise it will orphan those processes (and they will become zombie after they exit).
It can take a long time to wait for the poller processes to finish. The ultimate fix is probably to reparent the poller processes before restarting the dispatcher service. But this is beyond my python knowledge. That way they can exit and be cleaned up normally even after the original parent (dispatcher service) has exited.
Looks like Adam Bishop wrote some code related to restarting the dispatcher.
What does this return?
python3 -c "import psutil"; echo $?
this python3 -c “import psutil”; echo $?
is returning 0
Looks like you would call self._stop_managers() instead of self._stop_managers_and_wait() inside restart()
That is about as far as I can go.
Where can i edit this?