Zombie/Defunct process issue generated post OS upgrade of distributed pollers from Centos 7.9 to RHEL 8.9

Looks like you would call self._stop_managers() instead of self._stop_managers_and_wait() inside restart()

That is about as far as I can go.

Where can i edit this?

same for me also,

[root@servername ~]# python3 -c “import psutil”;echo $?
0

Where can we make further changes @murrant as you suggested to try using self._stop_managers() inside restart() ?

I have no suggestions, I just provided as much analysis as I can provide.
I am not seeing this behavior, so I’m of limited help.

Very happy for anyone to try things to see if they can solve the issue.

hi @murrant ,

I am just asking if you are suggesting this to try then in which code or in which script exactly ? or if you can share the link for Adam bishop code related to dispatcher , that will be great.

Regards
Vatansha

I have applied the changes to the file /opt/librenms/LibreNMS/service.py and still have zombie processes

I had the same issue on Rocky 8.9. I don’t fully recall how I got to the fix, but here is what I remember from my debugging session.

If you are running more then one poller there is a incompatible update between 24.3.0 and 24.4.1 because of the DB changes to the device_perf table. When the first poller updates to 24.4.0 it deletes the device_perf table. Then all the other pollers start creating zombie procs because they try to access this table (as they are still on the old version) during each device poll and the failed polling processes are not cleaned up properly. Because of this the librenms service runs out of file handlers so the daily upgrade process cannot execute on the pollers which are still on 24.4.0.

The fix is to run daily.sh on the pollers which are still on 24.4.0 and restart the librenms process.

Hope it helps.

this issue i am seeing from February and not when we got the latest update, still for troubleshooting purpose or if i have misunderstood, below is the ./validate.php output from one of the pollers and my front-end server(this doesn’t perform the polling)

[librenms@Front-endServer ~]$ ./validate.php

Component Version
LibreNMS 24.4.1 (2024-04-20T16:26:51+02:00)
DB Schema 2024_04_10_093513_remove_device_perf (291)
PHP 8.1.27
Python 3.6.8
Database MariaDB 10.3.39-MariaDB-log
RRDTool 1.7.2
SNMP 5.8

Poller Server

[librenms@PollerServer ~]$ ./validate.php

Component Version
LibreNMS 24.4.1 (2024-04-20T16:26:51+02:00)
DB Schema 2024_04_10_093513_remove_device_perf (291)
PHP 8.1.28
Python 3.6.8
Database MariaDB 10.3.39-MariaDB-log
RRDTool 1.7.0
SNMP 5.8

Also we are on RHEL 8.9 (Ootpa)

@ItsmeTelemetry where exactly have you tried making the recommended changes? I can also try the same.

Hi @Vatansha I have edited this file

/opt/librenms/LibreNMS/service.py (line 724, I think)
then a

  • systemctl daemon-reload
  • systemctl restart librenms.service

After the changes, I was seeing zombie processes. I am currently testing a small code to close the child processes for the PID. But I need to confirm if this is creating gaps in the graphs

How many of you having issues have mis-matched versions in your cluster?
Having mis-matched versions is explicitly not supported.

I don’t have this issue in my environment. This was the issue with @Erik-Lamers as he mentioned. All my pollers are on the same LibreNMS and DB Schema version.

@ItsmeTelemetry - I was going through the comments again, since the service runs from librenms-service.py and not from service.py, so you can try to make changes again and try then.

librenms.service - LibreNMS SNMP Poller Service
Loaded: loaded (/etc/systemd/system/librenms.service; enabled; vendor preset>
Active: active (running) since Mon 2024-05-06 16:59:54 CEST; 16h ago
Main PID: 3910349 (python3)
Tasks: 68 (limit: 824719)
Memory: 42.5M
CGroup: /system.slice/librenms.service
└─3910349 /usr/bin/python3 /opt/librenms/librenms-service.py -v

Will await for your response,i will also try to append this code and check if this makes a difference.

Regards
Vatansha

Hi All,

Just wante to add my 2cents here also have the same issue post OS/PHP upgrade. quite annoying. Single instance, not distributed.

Ubuntu 22.04.4 LTS (GNU/Linux 5.15.0-101-generic x86_64)

Version 24.4.1-33-g07afbe8b7 - Tue May 07 2024 12:56:24 GMT+1200
Database Schema 2024_04_22_161711_custom_maps_add_group (292)
Web Server nginx/1.18.0
PHP 8.1.2-1ubuntu2.17
Python 3.10.12
Database MariaDB 10.11.5-MariaDB-1:10.11.5+maria~ubu1804
Laravel 10.46.0
RRDtool 1.7.2

not sure how to go about fixing this! also seems to occur at midnight, leave zombie pollers. i run 16 workers and it always seems to be around 16 zombie’s sometime slightly more or less, every night.


this code is not in my librenms-service.py ?? is this where it should be?

Even i am confused as when i run systemctl status librenms, it is loaded as /opt/librenms/librenms-service.py but i found the restart-code in service.py and not in librenms-service.py . The pid for zombies related to librenms, so can someone confirm?

]# systemctl status librenms
● librenms.service - LibreNMS SNMP Poller Service
Loaded: loaded (/etc/systemd/system/librenms.service; enabled; vendor preset>
Active: active (running) since Tue 2024-04-30 10:21:49 CEST; 1 weeks 1 days >
Main PID: 1398724 (python3)
Tasks: 1124 (limit: 102216)
Memory: 5.1G
CGroup: /system.slice/librenms.service
├─1398724 /usr/bin/python3 /opt/librenms/librenms-service.py -v
├─3558063 php /opt/librenms/discovery.php -h 6927
├─3558481 php /opt/librenms/discovery.php -h 6936

ps -ef |grep -i defunct
librenms 70708 1398724 0 03:58 ? 00:00:06 [php]
librenms 134964 1398724 0 May01 ? 00:00:02 [php]
librenms 135606 1398724 0 May01 ? 00:00:02 [php]
librenms 137319 1398724 0 May01 ? 00:00:02 [php]
librenms 137324 1398724 0 May01 ? 00:00:01 [php]
librenms 137363 1398724 0 May01 ? 00:00:01 [php]
librenms 140048 1398724 0 May01 ? 00:00:02 [php]
librenms 142527 1398724 0 May01 ? 00:00:01 [php]
librenms 144334 1398724 0 May01 ? 00:00:01 [php]

i dont have a service.py? what path is this.

found it, for those that dont know /opt/librenms/LibreNMS/…