I have setup Librenms with the recommended rrdcached, cause the instance was slow, espacially the alert list is very slow.
but with rrdcached activated it was alot slower than befor.
maybe there is less IO but it seems that the cpu is used massively more.
it was unusable with rrdcached activated so i was needed to deactivate it again.
someone else recognized a similar behavior?
(i’ve asked in irc, but lost log on restart, so I’m sorry for asking again)
it would be nice if i get help, or hints where to start searching.
the server is a virtual machine with following specs:
16vcores
16gb ram
anything else relevant?
i’m monitoring 75 devices.
the overall utilization is not really high.
I’m using the dashboard with some widgets.
Availability-map
Device summary vert
Top 5 devices with traffic
top 5 devices with load
top 5 interface
unacknowledged alerts
external image with link
on dashboard reload i can see in htop 2 rrdcached processes every with about 75% cpu utilization, for about 20 seconds, if they are done the dashboard loads the widgets and the graphs, and starts typing loading on the alert widget, and a few seconds later i get the alerts list.
i think this is not really normal for that few devices.
if i disable rrdcached its the dashboard loading time is about 10 seconds instead of 20 - 30
i think both is too slow, but dont know where to start. and with rrdcached it gets slower is very strange.
I have checked the other things, and done the matching ones.
There is no load on the machine.
So there should be room for more performance.
If I’m wait on graphs i can see 2 processes of rrdcached with about 75% cpu usage.
but not more. is there a limit?
The machine has 16 cores available.
It is stuck if rrdcache is on processing data.
iotop shows me that rrdcached used nearly 100% of iops.
Other things i found is from mysqltuner:
[!!] Joins performed without indexes: 173486
[!!] Temporary tables created on disk: 54% (21K on disk / 38K total)
My setup ist:
Librenms
memcached
rrdcached
mysql with innodb_flush_log_at_trx_commit = 0
any hint where to go next to get more performance?
can you inspect which files are being accessed?
There’s a general idea (not LibreNMS origin) of running the rrdcached journal directory in ramdisk and flushing it to disk once an hour.
If your process takes too long for files in /var/tmp then move that to a ramdisk.
However if your process takes too long for real rrd files, then adjust the buffers.
Being a SAN, is it FC or FCoE? or iSCSI?
What about the filesystem?
I also see a similar behavior to @seti running this on a VM platform connected to a SAN, and RRDcached creates severe IO so much so the box is nearly unusable at times. System load averages were high as a result of the system waiting. In-fact I think this caused the gaps in the graphs I’ve been previously experiencing.
After RRDcached is turned off, atop and iotop shows my disk going down from 110% to 2% and everything is back to being responsive.
I think to be running RRDcached it really needs to be on a standalone server with a local disk.Though according to the RRDcached website there is a strange mention of IO.
The daemon was written with big setups in mind. Those setups usually run into IO related problems sooner or later for reasons that are beyond the scope of this document.
I can provide some screenshots if anyone’s interested.
laf i will look into your suggestion and thanks for fixing my last problem!