Zombie Process Issues

Hello,

I need some help with debugging a zombie process issue with my LibreNMS install. Everything was setup using the install docs. I run the dispatcher service and rrdcached. Using 5 minute polling for approximately 60 devices with plans to add more. On Monthly update cycle for stability reasons. Also integrated with smokeping.

Currently I am showing 18 zombie processes under the librenms user. They go away when I restart the librenms service only to reappear after a few hours it seems. When I run the top command, it shows the following:

Screenshot 2024-02-12 at 09.30.21

Output of ps command displays the following: https://pastebin.com/sQ01jRr6

My setup is as follows:

Server: Dedicated VM instance on Linode with 4 CPU cores and 8GB RAM
OS: Ubuntu 22.04 LTS
Web Server: nginx 1.18.0
PHP-FPM: 8.1.2
Mariadb: 10.6.16
Python 3.10.12
rrdtool: 1.7.2

validate.php and daily.sh scripts return no errors: https://pastebin.com/R9L0xshC

Checked all log files (librenms.log, daily.log, maintenance.log, php8.1-fpm.log, nginx error logs) All look good with no reported errors.

Here are my conf files for php-fpm and nginx: https://pastebin.com/jMsKN4Y6

I hope I have provided enough info here for someone to assist. If there is something more that you require, please let me know and I’ll post it. I’ve spent almost 2 weeks trying to debug this with no success. A google search on the issue doesn’t return much except one other article on defunct processes that suggests it’s an issue with PHP that the developers won’t fix. Another user posted a possible solution but not quite sure that I understand it or if it pertains to my issue.

The server seems to run normally and I am not getting any gateway timeouts. Devices appear to poll normally and graphs are displaying.

Thanks in advance for the help.

Jaysen

Do they build up if you leave them or do they die off after a bit?

They build up over time but never seem to die off. Now showing 36 zombies. They only go away when I issue the systemctl restart librenms command but then they will start building up again. Not sure why this keeps happening.

Screenshot 2024-02-13 at 15.28.33

Even i need some help with the zombie processes.
Scenario is little different here , we have upgraded the OS version of the pollers from Centos 7.9 to RHEL 8.9 and post this we are seeing this issue. Poller status , librenms service status , librenms.log, daily.log everything seems normal.
I ran validate.php and daily.sh script as well, even that is ok.

When i restart the librenms service , defunct process goes down to zero and then again starts building up.

bash-4.2$ ./daily.sh
Fetching new release information OK
Updating to latest release OK
Updating Composer packages OK
Updating SQL-Schema OK
Updating submodules OK
Cleaning up DB OK
Fetching notifications OK
Caching PeeringDB data OK

bash-4.2$ ./validate.php

Component Version
LibreNMS 24.1.0 (2024-01-07T16:49:52+01:00)
DB Schema 2023_12_15_105529_access_points_nummonbssid_integer (276)
PHP 8.1.27
Python 3.6.12
Database MariaDB 10.2.33-MariaDB-log
RRDTool 1.7.1
SNMP 5.7.2

[OK] Composer Version: 2.7.1
[OK] Dependencies up-to-date.
[OK] Database connection successful
[OK] Database Schema is current
[OK] SQL Server meets minimum requirements
[OK] lower_case_table_names is enabled
[OK] MySQL engine is optimal
[OK] Database and column collations are correct
[OK] Database schema correct
[OK] MySQl and PHP time match
[OK] Distributed Polling setting is enabled globally
[OK] Connected to rrdcached
[OK] Active pollers found
[OK] Dispatcher Service is enabled
[OK] Locks are functional
[OK] Python wrapper cron entry is not present
[OK] Redis is functional
[OK] rrdtool version ok
[OK] Connected to rrdcached

Can someone help?

@jaysen
check for php messages in /var/log/messages

As in my case i think these are the process getting created when discovery.php is running. Usually they get reaped as well but now getting stuck post OS migration in my case.

So you can check once, maybe that might help you in finding the rootcause.

I don’t have a solution to this yet but still troubleshooting.

I’ll double check those logs and see if I notice anything. Thanks

I double checked the logs including the messages log. Nothing concerning in there. I also checked my php8.1-fpm.log file and found this entry.

[16-Feb-2024 09:07:01] NOTICE: [pool librenms] child 2029310 exited with code 0 after 41.127786 seconds from start

I searched google for this and I am seeing posts that suggest a php bug but it doesn’t mention anything about zombie processes so I am not sure.

Still having the issue. how about you? any logs/error related to composer dependencies?

Regards
Vatansha

Yes, still having the issue. Nothing in logs regarding composer. I was thinking of trying to disable the dispatcher and going back to using the cron scripts to see if that makes a difference. Not sure if it will but worth a try. I don’t recall ever having this issue when using the cron scripts on previous LibreNMS installations. This is my first time running it with a dispatcher service so something tells me it could be related but not too sure. I’ll see what happens and report my findings here shortly.

Thanks,

Jaysen

I have a fairly similar setup to you, but I am still using cron and not the “dispatcher” service… no zombie processes over here.

After switching back to cron and removing dispatcher service. The issue appears to be resolved. I haven’t seen any zombies since I made the changes.

I guess the question now should be why is the dispatcher service causing so many zombies? But that’s likely an issue for another thread.

Jaysen