TL;DR - rrdcached is reporting “too many open files” and I’m not sure what else to look at.
We are running a distributed poller system with 3 pollers for ~2400 devices. Our database server has mysql, rrdcached, memcached, and redis running on it. I plan to split rrdcached and RRD files to its own server soon.
I am starting to see some devices not poll in the poll period of 60 seconds, so I have added more polling workers to the librenms dispatcher job. Now we are getting gaps in graphing and our rrdcached service is spitting out “too many open files” errors.
Jan 28 08:52:59 lnms-db rrdcached[975]: queue_thread_main: rrd_update_r (/opt/librenms/rrd/x.x.x.x/port-id12990.rrd) failed with status -1. (opening '/opt/librenms/rrd/x.x.x.x/port-id12990.rrd': Too many open files)
Jan 28 08:52:59 lnms-db rrdcached[975]: queue_thread_main: rrd_update_r (/opt/librenms/rrd/x.x.x.x/port-id12987.rrd) failed with status -1. (opening '/opt/librenms/rrd/x.x.x.x/port-id12987.rrd': Too many open files)
Jan 28 08:52:59 lnms-db rrdcached[975]: queue_thread_main: rrd_update_r (/opt/librenms/rrd/x.x.x.x/port-id12991.rrd) failed with status -1. (opening '/opt/librenms/rrd/x.x.x.x/port-id12991.rrd': Too many open files)
Jan 28 08:52:59 lnms-db rrdcached[975]: queue_thread_main: rrd_update_r (/opt/librenms/rrd/x.x.x.x/port-id12993.rrd) failed with status -1. (opening '/opt/librenms/rrd/x.x.x.x/port-id12993.rrd': Too many open files)
Jan 28 08:52:59 lnms-db rrdcached[975]: queue_thread_main: rrd_update_r (/opt/librenms/rrd/x.x.x.x/port-id12995.rrd) failed with status -1. (opening '/opt/librenms/rrd/x.x.x.x/port-id12995.rrd': Too many open files)
I have not dealt with open file errors on linux systems before. A quick google search led me to check system-wide limit (sysctl fs.file-max
), and user max (ulimit -Hn
). I have set the user max from 1024 to 1048576, but when checking open files during the time the logs are happening, I’m only seeing ~1024 open files from the user.
librenms@lnms-db:~$ ulimit -Sn
1048576
librenms@lnms-db:~$ ulimit -Hn
1048576
$ sudo lsof -u librenms | wc -l
1073
Does anyone have any ideas on what else to change? Is this something more specific with rrdcached and not the OS?