Team,
I want to say to start, our rrdcached config is pretty basic. I looked at the write_timeout and threads but not sure if changing those will help our issue (or maybe make it worse).
I am currently polling 3000 devices (will grow to 20,000), polling is going fine, at about 150 seconds. The issue I have is since last week, rrdcache started crashing and me and the Unix team have not been able to find anything, even after lowering the logging level. We have lowered it from the default ERR to WARNING (no help) and just now to (INFO).
We see this in the /var/log/messages, it looks like it is having a problem reading/updating RRD files and then crashing.
rrdcached[90536] : handle_request_update: Could not read RRD file.
kernel: [3120158.346744] rrdcached[90730]: segfault at 0 ip 00007f11b0c1042b sp 00007f11a3ffec10 error 4 in libc-2.27.so[7f11b0ade000+1e7000]
We have looked at file limits and increased those but the issue still exists.
I did some research but couldnt come up with much. I saw some old bug reports but they were several years old and those appeared to be for something else.
Any ideas/help would be greatly appreciated. If any additional information is needed, please let me know and I will be happy to provide.
Ben