I have almost 16 pollers out of which i have upgraded 15 of them from Linux CentOS Linux release 7.9.2009 (Core) to RHEL 8.9 (Ootpa), since then we are seeing a lot of zombie process getting created and i need to restart the services in order to get rid of defunct/zombie processes.
The WebUI server has below configuration and it is still running on Centos 7.9 as of now but we will upgrade it’s OS as well to RHEL 8.9 soon.
-bash-4.2$ ./validate.php
Component
Version
LibreNMS
24.2.0 (2024-02-27T19:54:10+01:00)
DB Schema
2024_02_07_151845_custom_map_additions (290)
PHP
8.1.27
Python
3.6.12
Database
MariaDB 10.2.33-MariaDB-log
RRDTool
1.7.1
SNMP
5.7.2
===========================================
[OK] Composer Version: 2.7.1
[OK] Dependencies up-to-date.
[OK] Database connection successful
[OK] Database Schema is current
[OK] SQL Server meets minimum requirements
[OK] lower_case_table_names is enabled
[OK] MySQL engine is optimal
[OK] Database and column collations are correct
[OK] Database schema correct
[OK] MySQL and PHP time match
[OK] Distributed Polling setting is enabled globally
[OK] Connected to rrdcached
[OK] Active pollers found
[OK] Dispatcher Service is enabled
[OK] Locks are functional
[OK] Python wrapper cron entry is not present
[OK] Redis is functional
[OK] rrdtool version ok
[OK] Connected to rrdcached
bash-4.2$ ./daily.sh
Fetching new release information OK
Updating to latest release OK
Updating Composer packages OK
Updating SQL-Schema OK
Updating submodules OK
Cleaning up DB OK
Fetching notifications OK
Caching PeeringDB data OK
One modification that was required post OS upgrade was that we modified the Exec start line for librenms service at /etc/systemd/system/librenms.service since rhpython36 package is no longer available in RHEL 8.9
ExecStart=/usr/bin/scl enable rh-python36 – /opt/librenms/librenms-service.py -v
To
ExecStart=/opt/librenms/librenms-service.py -v
And we also installed php 8.1 by following steps given in [Installing LibreNMS - LibreNMS Docs] (Installing LibreNMS - LibreNMS Docs)
Centos8 and Apache
php --version
PHP 8.1.27 (cli) (built: Dec 19 2023 20:35:55) (NTS gcc x86_64)
Copyright (c) The PHP Group
Zend Engine v4.1.27, Copyright (c) Zend Technologies
with Zend OPcache v8.1.27, Copyright (c), by Zend Technologies
Have gone through the logs librenms/daily logs on multiple pollers but haven’t found any particular error.
Kindly help me in resolving the issue as i am trying to troubleshoot this issue since last 20 days.
Can anyone assist here as i know 4 environments including mine getting affected with defunct process post OS upgrade and it doesn’t constrict to any particular OS. My servers run on RHEL 8.9 and others run on Ubuntu majorly. @ItsmeTelemetry@mattias.agar do you guys use Dispatcher service or use the cron job for distributed pollers? as in my case i have the librenms service running on all my 15 pollers and i am not using the cronjob.
Also you can go through below article from other environment. May be that helps you.
Can you go through the case logs from March 5th as i have mentioned all the information there. I have completed upgrading the WebUI server as well and now all the 15 pollers and WebUI are running on RHEL 8.9.
Earlier we were on Centos 7.9 which uses rhpython36 package, since it’s not available in RHEL 8.9 , so edited the librenms service file as below
ExecStart=/usr/bin/scl enable rh-python36 – /opt/librenms/librenms-service.py -v
To
ExecStart=/opt/librenms/librenms-service.py -v
I am not using cron for the distributed pollers, instead we have the librenms service running on all my pollers.
And we also installed php 8.1 by following steps given in [Installing LibreNMS - LibreNMS Docs] (Installing LibreNMS - LibreNMS Docs)
Centos8 and Apache section
php --version
PHP 8.1.27 (cli) (built: Dec 19 2023 20:35:55) (NTS gcc x86_64)
Copyright (c) The PHP Group
Zend Engine v4.1.27, Copyright (c) Zend Technologies
with Zend OPcache v8.1.27, Copyright (c), by Zend Technologies
Also attaching the latest validate.php and daily.sh for you to go through once.
It will be great if i can track what exactly is causing the zombies . It gets killed once we restart the services
@murrant completed the upgrade on March 21st and just now it got upgraded as well from 24.2.0 to 24.3.0, sharing the latest logs
[librenms@myserver ~]$ ./validate.php
Component
Version
LibreNMS
24.3.0 (2024-04-01T17:18:44+02:00)
DB Schema
2024_02_07_151845_custom_map_additions (290)
PHP
8.1.27
Python
3.6.8
Database
MariaDB 10.3.39-MariaDB-log
RRDTool
1.7.2
SNMP
5.8
===========================================
[OK] Composer Version: 2.7.2
[OK] Dependencies up-to-date.
[OK] Database connection successful
[OK] Database Schema is current
[OK] SQL Server meets minimum requirements
[OK] lower_case_table_names is enabled
[OK] MySQL engine is optimal
[OK] Database and column collations are correct
[OK] Database schema correct
[OK] MySQL and PHP time match
[OK] Distributed Polling setting is enabled globally
[OK] Connected to rrdcached
[OK] Active pollers found
[OK] Dispatcher Service is enabled
[OK] Locks are functional
[OK] Python wrapper cron entry is not present
[OK] Redis is functional
[OK] rrdtool version ok
[OK] Connected to rrdcached
librenms@myserver~]$ ./daily.sh
Fetching new release information OK
Updating to latest release OK
Updating Composer packages OK
Updating SQL-Schema OK
Updating submodules OK
Cleaning up DB OK
Fetching notifications OK
Caching PeeringDB data OK
it happens around mid-night, what can i look for ? Let me know if any other logs you require.
Also it’s something related to pollers in general as on all my pollers, something runs at midnight which increases the defunct process count and it’s keep on getting piled up until i restart the librenms service on it. However i last restarted my WebUI server on April 10th and since then i have 107 defunct process on this machine till today.
But on all the pollers i need to restart the librenms service every 4th day.
not sure about this too as all 4 of us are running different python versions.
I am using python 3.6.8 @mattias.agar is using python 3.8.10 @ItsmeTelemetry is using python 3.10.12 and your’s is also the same but he is also having the same issue.
which logs/processes can i look further to check?
and can i go ahead and try upgrading python to 3.10 on one of the poller machine and see if that would make any difference? Having said that it should not create any issue in the environment?
I am using python 3.10 and still have the zombie processes. As @Vatanasha said something is running at midnight that increases the Zombie processes. I have tried to increase the ‘pm. child’ on /etc/php/8.1/fpm/pool.d/www.conf and didn’t solve the issue. Zombie processes are related to the python3
|-python3(2148779)-±php(1318711)
Seems like the process are related to polling, so changing the fpm child limit will not help.
Yeah, it is the maintenance script.
It stops all pollers and relaunches the dispatcher service. It is supposed to wait from them to finish before relaunching the service, otherwise it will orphan those processes (and they will become zombie after they exit).
It can take a long time to wait for the poller processes to finish. The ultimate fix is probably to reparent the poller processes before restarting the dispatcher service. But this is beyond my python knowledge. That way they can exit and be cleaned up normally even after the original parent (dispatcher service) has exited.