RRDcached - Too many open files

TL;DR - rrdcached is reporting “too many open files” and I’m not sure what else to look at.


We are running a distributed poller system with 3 pollers for ~2400 devices. Our database server has mysql, rrdcached, memcached, and redis running on it. I plan to split rrdcached and RRD files to its own server soon.

I am starting to see some devices not poll in the poll period of 60 seconds, so I have added more polling workers to the librenms dispatcher job. Now we are getting gaps in graphing and our rrdcached service is spitting out “too many open files” errors.

Jan 28 08:52:59 lnms-db rrdcached[975]: queue_thread_main: rrd_update_r (/opt/librenms/rrd/x.x.x.x/port-id12990.rrd) failed with status -1. (opening '/opt/librenms/rrd/x.x.x.x/port-id12990.rrd': Too many open files)
Jan 28 08:52:59 lnms-db rrdcached[975]: queue_thread_main: rrd_update_r (/opt/librenms/rrd/x.x.x.x/port-id12987.rrd) failed with status -1. (opening '/opt/librenms/rrd/x.x.x.x/port-id12987.rrd': Too many open files)
Jan 28 08:52:59 lnms-db rrdcached[975]: queue_thread_main: rrd_update_r (/opt/librenms/rrd/x.x.x.x/port-id12991.rrd) failed with status -1. (opening '/opt/librenms/rrd/x.x.x.x/port-id12991.rrd': Too many open files)
Jan 28 08:52:59 lnms-db rrdcached[975]: queue_thread_main: rrd_update_r (/opt/librenms/rrd/x.x.x.x/port-id12993.rrd) failed with status -1. (opening '/opt/librenms/rrd/x.x.x.x/port-id12993.rrd': Too many open files)
Jan 28 08:52:59 lnms-db rrdcached[975]: queue_thread_main: rrd_update_r (/opt/librenms/rrd/x.x.x.x/port-id12995.rrd) failed with status -1. (opening '/opt/librenms/rrd/x.x.x.x/port-id12995.rrd': Too many open files)

I have not dealt with open file errors on linux systems before. A quick google search led me to check system-wide limit (sysctl fs.file-max), and user max (ulimit -Hn). I have set the user max from 1024 to 1048576, but when checking open files during the time the logs are happening, I’m only seeing ~1024 open files from the user.

librenms@lnms-db:~$ ulimit -Sn
1048576
librenms@lnms-db:~$ ulimit -Hn
1048576

$ sudo lsof -u librenms | wc -l
1073

Does anyone have any ideas on what else to change? Is this something more specific with rrdcached and not the OS?

It seems like there is another open file limit, which is per process.

Using prlimit to see how many files the rrdcached process can have open, I found this:

$ sudo prlimit -p $(pgrep rrdcached) -n
RESOURCE DESCRIPTION              SOFT HARD UNITS
NOFILE   max number of open files 1024 4096 files

I could use prlimit to change the nofile limit for that process, but if I restarted rrdcached, the process would change and the limit would be be back to 1024. I tried finding a way to set the global limits for all processes, but didn’t find much on that. Instead, I altered the init file at /etc/init.d/rrdcached (Ubuntu 18) to run prlimit after the service is started.

Before:

do_start () {
    start_daemon -p ${PIDFILE} ${DAEMON} ${RRDCACHED_OPTIONS}
    return $?
}

After:

do_start () {
    start_daemon -p ${PIDFILE} ${DAEMON} ${RRDCACHED_OPTIONS}
    rv=$?
    PID=$( pidofproc -p ${PIDFILE} ${DAEMON} )
    prlimit -p$PID -n4096
    return $rv
}

I’m not a fan of this change, as I think an update to rrdcached from apt/dpkg would undo it. Hopefully this helps anyone else with a similar problem, and if anyone has any other suggestions, please let me know!