I need some help with debugging a zombie process issue with my LibreNMS install. Everything was setup using the install docs. I run the dispatcher service and rrdcached. Using 5 minute polling for approximately 60 devices with plans to add more. On Monthly update cycle for stability reasons. Also integrated with smokeping.
Currently I am showing 18 zombie processes under the librenms user. They go away when I restart the librenms service only to reappear after a few hours it seems. When I run the top command, it shows the following:
Server: Dedicated VM instance on Linode with 4 CPU cores and 8GB RAM
OS: Ubuntu 22.04 LTS
Web Server: nginx 1.18.0
PHP-FPM: 8.1.2
Mariadb: 10.6.16
Python 3.10.12
rrdtool: 1.7.2
I hope I have provided enough info here for someone to assist. If there is something more that you require, please let me know and I’ll post it. I’ve spent almost 2 weeks trying to debug this with no success. A google search on the issue doesn’t return much except one other article on defunct processes that suggests it’s an issue with PHP that the developers won’t fix. Another user posted a possible solution but not quite sure that I understand it or if it pertains to my issue.
The server seems to run normally and I am not getting any gateway timeouts. Devices appear to poll normally and graphs are displaying.
They build up over time but never seem to die off. Now showing 36 zombies. They only go away when I issue the systemctl restart librenms command but then they will start building up again. Not sure why this keeps happening.
Even i need some help with the zombie processes.
Scenario is little different here , we have upgraded the OS version of the pollers from Centos 7.9 to RHEL 8.9 and post this we are seeing this issue. Poller status , librenms service status , librenms.log, daily.log everything seems normal.
I ran validate.php and daily.sh script as well, even that is ok.
When i restart the librenms service , defunct process goes down to zero and then again starts building up.
bash-4.2$ ./daily.sh
Fetching new release information OK
Updating to latest release OK
Updating Composer packages OK
Updating SQL-Schema OK
Updating submodules OK
Cleaning up DB OK
Fetching notifications OK
Caching PeeringDB data OK
[OK] Composer Version: 2.7.1
[OK] Dependencies up-to-date.
[OK] Database connection successful
[OK] Database Schema is current
[OK] SQL Server meets minimum requirements
[OK] lower_case_table_names is enabled
[OK] MySQL engine is optimal
[OK] Database and column collations are correct
[OK] Database schema correct
[OK] MySQl and PHP time match
[OK] Distributed Polling setting is enabled globally
[OK] Connected to rrdcached
[OK] Active pollers found
[OK] Dispatcher Service is enabled
[OK] Locks are functional
[OK] Python wrapper cron entry is not present
[OK] Redis is functional
[OK] rrdtool version ok
[OK] Connected to rrdcached
@jaysen
check for php messages in /var/log/messages
As in my case i think these are the process getting created when discovery.php is running. Usually they get reaped as well but now getting stuck post OS migration in my case.
So you can check once, maybe that might help you in finding the rootcause.
I don’t have a solution to this yet but still troubleshooting.
Yes, still having the issue. Nothing in logs regarding composer. I was thinking of trying to disable the dispatcher and going back to using the cron scripts to see if that makes a difference. Not sure if it will but worth a try. I don’t recall ever having this issue when using the cron scripts on previous LibreNMS installations. This is my first time running it with a dispatcher service so something tells me it could be related but not too sure. I’ll see what happens and report my findings here shortly.