Zombie/Defunct process issue generated post OS upgrade of distributed pollers from Centos 7.9 to RHEL 8.9

Vatansha · 5 March 2024 08:27

I have almost 16 pollers out of which i have upgraded 15 of them from Linux CentOS Linux release 7.9.2009 (Core) to RHEL 8.9 (Ootpa), since then we are seeing a lot of zombie process getting created and i need to restart the services in order to get rid of defunct/zombie processes.

The WebUI server has below configuration and it is still running on Centos 7.9 as of now but we will upgrade it’s OS as well to RHEL 8.9 soon.

-bash-4.2$ ./validate.php

Component	Version
LibreNMS	24.2.0 (2024-02-27T19:54:10+01:00)
DB Schema	2024_02_07_151845_custom_map_additions (290)
PHP	8.1.27
Python	3.6.12
Database	MariaDB 10.2.33-MariaDB-log
RRDTool	1.7.1
SNMP	5.7.2
===========================================

[OK] Composer Version: 2.7.1
[OK] Dependencies up-to-date.
[OK] Database connection successful
[OK] Database Schema is current
[OK] SQL Server meets minimum requirements
[OK] lower_case_table_names is enabled
[OK] MySQL engine is optimal
[OK] Database and column collations are correct
[OK] Database schema correct
[OK] MySQL and PHP time match
[OK] Distributed Polling setting is enabled globally
[OK] Connected to rrdcached
[OK] Active pollers found
[OK] Dispatcher Service is enabled
[OK] Locks are functional
[OK] Python wrapper cron entry is not present
[OK] Redis is functional
[OK] rrdtool version ok
[OK] Connected to rrdcached

bash-4.2$ ./daily.sh
Fetching new release information OK
Updating to latest release OK
Updating Composer packages OK
Updating SQL-Schema OK
Updating submodules OK
Cleaning up DB OK
Fetching notifications OK
Caching PeeringDB data OK

One modification that was required post OS upgrade was that we modified the Exec start line for librenms service at /etc/systemd/system/librenms.service since rhpython36 package is no longer available in RHEL 8.9

ExecStart=/usr/bin/scl enable rh-python36 – /opt/librenms/librenms-service.py -v
To
ExecStart=/opt/librenms/librenms-service.py -v

And we also installed php 8.1 by following steps given in [Installing LibreNMS - LibreNMS Docs] (Installing LibreNMS - LibreNMS Docs)
Centos8 and Apache

php --version
PHP 8.1.27 (cli) (built: Dec 19 2023 20:35:55) (NTS gcc x86_64)
Copyright (c) The PHP Group
Zend Engine v4.1.27, Copyright (c) Zend Technologies
with Zend OPcache v8.1.27, Copyright (c), by Zend Technologies

Have gone through the logs librenms/daily logs on multiple pollers but haven’t found any particular error.

Kindly help me in resolving the issue as i am trying to troubleshoot this issue since last 20 days.

Thank you in advance.

Regards
Vatansha

Vatansha · 5 March 2024 08:31

Vatansha · 13 March 2024 06:25

Can anyone assist here?any thoughts?

mattias.agar · 16 April 2024 08:10

Experiencing the same issue here after OS upgrade, albeit on Ubuntu here.

Vatansha · 18 April 2024 10:55

Hi Mattias,

Are you having the same environment as in php and python version?

I have upgraded my servers from Centos 7.9 to RHEL 8.9. What’s your environment like?

mattias.agar · 18 April 2024 13:06

Component	Version
LibreNMS	24.3.0 (2024-04-01T17:18:44+02:00)
DB Schema	2024_02_07_151845_custom_map_additions (290)
PHP	8.1.10
Python	3.8.10
Database	MariaDB 10.5.22-MariaDB-1:10.5.22+maria~ubu1804
RRDTool	1.7.2
SNMP	5.8

Upgraded from Ubuntu 18.04 to 20.04

ItsmeTelemetry · 18 April 2024 19:46

Same problem here

===========================================

Component	Version
LibreNMS	24.3.0 (2024-04-01T15:18:44+00:00)
DB Schema	2024_02_07_151845_custom_map_additions (290)
PHP	8.1.2-1ubuntu2.14
Python	3.10.12
Database	MariaDB 10.6.16-MariaDB-0ubuntu0.22.04.1-log
RRDTool	1.7.2
SNMP	5.9.1
===========================================

Ubuntu 22.04.4 LTS. Installed from zero

Vatansha · 19 April 2024 07:26

hi All,

Can anyone assist here as i know 4 environments including mine getting affected with defunct process post OS upgrade and it doesn’t constrict to any particular OS. My servers run on RHEL 8.9 and others run on Ubuntu majorly. @ItsmeTelemetry @mattias.agar do you guys use Dispatcher service or use the cron job for distributed pollers? as in my case i have the librenms service running on all my 15 pollers and i am not using the cronjob.

Also you can go through below article from other environment. May be that helps you.

Regards
Vatansha

ItsmeTelemetry · 19 April 2024 13:36

I am using the service. Here we have 5 min poller for over 250 devices. Same as yours with smokeping integration and only stable versions monthly

jaysen · 19 April 2024 13:38

I switched back to using cron job and it resolved all of my issues. No more Zombies

murrant · 19 April 2024 14:23

No issues on my Ubuntu server. I wonder what change they made would cause that?

Vatansha · 19 April 2024 14:46

Hi Murrant,

Can you go through the case logs from March 5th as i have mentioned all the information there. I have completed upgrading the WebUI server as well and now all the 15 pollers and WebUI are running on RHEL 8.9.
Earlier we were on Centos 7.9 which uses rhpython36 package, since it’s not available in RHEL 8.9 , so edited the librenms service file as below
ExecStart=/usr/bin/scl enable rh-python36 – /opt/librenms/librenms-service.py -v
To
ExecStart=/opt/librenms/librenms-service.py -v

I am not using cron for the distributed pollers, instead we have the librenms service running on all my pollers.

And we also installed php 8.1 by following steps given in [Installing LibreNMS - LibreNMS Docs] (Installing LibreNMS - LibreNMS Docs)
Centos8 and Apache section

ran below commands as given in the document

yum install python3-dotenv
dnf -y install epel-release
dnf -y install dnf-utils http://rpms.remirepo.net/enterprise/remi-release-8.rpm
dnf module reset php
dnf module enable php:remi-8.1
dnf install bash-completion cronie fping gcc git httpd ImageMagick mariadb-server mtr net-snmp net-snmp-utils nmap php-fpm php-cli php-common php-curl php-gd php-gmp php-json php-mbstring php-process php-snmp php-xml php-zip php-mysqlnd python3 python3-devel python3-PyMySQL python3-redis python3-memcached python3-pip python3-systemd rrdtool unzip php-fedora-autoloader-1.0.1-7.el8.noarch

php --version
PHP 8.1.27 (cli) (built: Dec 19 2023 20:35:55) (NTS gcc x86_64)
Copyright (c) The PHP Group
Zend Engine v4.1.27, Copyright (c) Zend Technologies
with Zend OPcache v8.1.27, Copyright (c), by Zend Technologies

Also attaching the latest validate.php and daily.sh for you to go through once.
It will be great if i can track what exactly is causing the zombies . It gets killed once we restart the services

Vatansha · 19 April 2024 15:01

@murrant completed the upgrade on March 21st and just now it got upgraded as well from 24.2.0 to 24.3.0, sharing the latest logs

[librenms@myserver ~]$ ./validate.php

Component	Version
LibreNMS	24.3.0 (2024-04-01T17:18:44+02:00)
DB Schema	2024_02_07_151845_custom_map_additions (290)
PHP	8.1.27
Python	3.6.8
Database	MariaDB 10.3.39-MariaDB-log
RRDTool	1.7.2
SNMP	5.8
===========================================

[OK] Composer Version: 2.7.2
[OK] Dependencies up-to-date.
[OK] Database connection successful
[OK] Database Schema is current
[OK] SQL Server meets minimum requirements
[OK] lower_case_table_names is enabled
[OK] MySQL engine is optimal
[OK] Database and column collations are correct
[OK] Database schema correct
[OK] MySQL and PHP time match
[OK] Distributed Polling setting is enabled globally
[OK] Connected to rrdcached
[OK] Active pollers found
[OK] Dispatcher Service is enabled
[OK] Locks are functional
[OK] Python wrapper cron entry is not present
[OK] Redis is functional
[OK] rrdtool version ok
[OK] Connected to rrdcached

librenms@myserver~]$ ./daily.sh
Fetching new release information OK
Updating to latest release OK
Updating Composer packages OK
Updating SQL-Schema OK
Updating submodules OK
Cleaning up DB OK
Fetching notifications OK
Caching PeeringDB data OK

it happens around mid-night, what can i look for ? Let me know if any other logs you require.

Vatansha · 19 April 2024 15:11

Also it’s something related to pollers in general as on all my pollers, something runs at midnight which increases the defunct process count and it’s keep on getting piled up until i restart the librenms service on it. However i last restarted my WebUI server on April 10th and since then i have 107 defunct process on this machine till today.

But on all the pollers i need to restart the librenms service every 4th day.

Regards
Vatansha

murrant · 19 April 2024 15:24

Hmm, I’m using Python 3.10. Was there a bug in older python versions perhaps?

Vatansha · 19 April 2024 15:42

not sure about this too as all 4 of us are running different python versions.
I am using python 3.6.8
@mattias.agar is using python 3.8.10
@ItsmeTelemetry is using python 3.10.12 and your’s is also the same but he is also having the same issue.

which logs/processes can i look further to check?

and can i go ahead and try upgrading python to 3.10 on one of the poller machine and see if that would make any difference? Having said that it should not create any issue in the environment?

ItsmeTelemetry · 19 April 2024 16:40

I am using python 3.10 and still have the zombie processes. As @Vatanasha said something is running at midnight that increases the Zombie processes. I have tried to increase the ‘pm. child’ on /etc/php/8.1/fpm/pool.d/www.conf and didn’t solve the issue. Zombie processes are related to the python3
|-python3(2148779)-±php(1318711)

librenms 1318711 0.0 0.0 0 0 ? Zs Apr18 0:02 [php]

murrant · 19 April 2024 20:34

Seems like the process are related to polling, so changing the fpm child limit will not help.

Yeah, it is the maintenance script.

It stops all pollers and relaunches the dispatcher service. It is supposed to wait from them to finish before relaunching the service, otherwise it will orphan those processes (and they will become zombie after they exit).

It can take a long time to wait for the poller processes to finish. The ultimate fix is probably to reparent the poller processes before restarting the dispatcher service. But this is beyond my python knowledge. That way they can exit and be cleaned up normally even after the original parent (dispatcher service) has exited.

murrant · 19 April 2024 20:46

Looks like Adam Bishop wrote some code related to restarting the dispatcher.

What does this return?
python3 -c "import psutil"; echo $?

ItsmeTelemetry · 19 April 2024 20:49

this python3 -c “import psutil”; echo $?

is returning 0