Need help to improve WebUI performance

Hello,

I’m having some problems related with the Web UI performance when loading some graphics.

Sometimes I have red graphs in the interface, like the print below:

error-libre

I’m trying to identify the main cause for this problem, which by now I think it is related to the rrdcached performance.

My rrdcached server is located in one of my pollers, the machine has 8 cpu cores and 28 GB of RAM and monitor near 1000 devices in around 400 seconds (my polling interval is 600 seconds), when the polling process start, the load jumps but the cpu processing stays near 50%.

I’m using the rrdcached in this poller so I can speed things a little using a Unix Socket, the other pollers and the web server use a port to connect to the RRDCached server (I’m using multiple -l parameters in the service).

For what I understood till now, the pollers open a connection to the rrdcached server for each device that is being polled and the web server opens a connection for each graphics it needs to show, so the rrdcached server will write the updates in the disk and I can see the most recent data polled.

I tried to configure the web server to read the rrd files directly from the shared folder, but this way I would not get last updated data that has been polled, since it could be in the memory of the rrdcached service.

What would be the best approach to avoid getting those red graphs? It is possible to have multiple instances of the RRDCached? Since I’m using a cloud service I really need to try to use the resources I have to the maximum to avoid increasing the costs, but if there is no any other way I can spin up another machine.

You should click on one of those graphs and then select show rrd command to see why the graph is failing to generate.

Your setup is definitely out of scope of normal setup so you might find anyone who can help with this specific setup.

For polling, we open an rrdtool command ready to accept data but not an actual connection. I’ve played around with code to allow a single connection to rrdcached and perform bulk updates but never finished it off.

Hello,

Well, if I click on one of those graphs it will probably load normal, if not, a refresh on the page will force the graph to be loaded, there is a bottleneck somewhere that I’m still trying to find and fix.

It looks that my setup is really big, and it will increase more and more since we are to planning to double the number of monitored devices in the next months.

I will try to size up the web server/database machine and move the rrdcached to it, to see if there is any improvement in performance.

We run 4000+ devices on ours with dedicated Web, rrdcached, mysql and 4 pollers. Most running 12 core CPUs with 32GB ram.

If you can get a page to show the rrd command output when it goes red that will help you diagnose this.

Otherwise, make sure all boxes that run LibreNMS are being monitored so you can see any issues and go through the performance docs and tune as much as you can.

What is the configuration of your RRDCached server? It is also 12 core and 32 GB?

E5-2620 0 @ 2.00GHz x24. Can’t see ram right now

When I click the red graph it opens another page and show the graph without problem, sometimes it shows the graph large, but does not show the thumbnail for the same graph (those, one week, two week graphs for example).

The RRD Output always appear as OK or blank, it does not show any error.

I’ve just created another machine only for the rrdcached, with 16 cores and 32 GB, but I didn’t saw any significant improvement.

The web server/database machine is running fine, I have no errors on nginx, mariadb or php-fpm, the rrdcached is running fine also, and for each graph that the web ui needs to show it opens a connection to the rrdcached server.

Looking in inspect on google chrome I can see that the graphs that shows up as red graphs takes too long to load, more than 16 seconds, but I still wasn’t able to find where the bottleneck is, I already that the performance optimizations from the documentation.

What else should I look?

You can right click the red graph, copy the image url and paste it in with &debug=true on the end

Refresh that until it breaks and you will have the debug info back.

And what do I need to look for to identify the error?

I always have something like the two outputs below:

This one returns fast and I have a line with the files permissions before the runtime.

graph /tmp/1KZYgqfdfUpiUVmU -g -l 0 -u 100 -E --start 1516208400 --end 1516294800 --width 150 --height 45 -c BACK#EEEEEE00 -c SHADEA#EEEEEE00 -c SHADEB#EEEEEE00 -c FONT#000000 -c CANVAS#FFFFFF00 -c GRID#a5a5a5 -c MGRID#FF9999 -c FRAME#5e5e5e -c ARROW#5e5e5e -R normal -c CANVAS#FFFFFF00 --only-graph --font LEGEND:7:DejaVuSansMono --font AXIS:6:DejaVuSansMono --font-render-mode normal COMMENT:' Size Used %age\l' DEF:2217used=s054vmhomelk/storage-hrstorage-_.rrd:used:AVERAGE DEF:2217free=s054vmhomelk/storage-hrstorage-_.rrd:free:AVERAGE CDEF:2217size=2217used,2217free,+ CDEF:2217perc=2217used,2217size,/,100,* LINE1.25:2217perc#CC0000:'/ ' GPRINT:2217size:LAST:%6.2lf%sB GPRINT:2217used:LAST:%6.2lf%sB GPRINT:2217perc:LAST:%5.2lf%%\l DEF:2218used=s054vmhomelk/storage-hrstorage-_dev_shm.rrd:used:AVERAGE DEF:2218free=s054vmhomelk/storage-hrstorage-_dev_shm.rrd:free:AVERAGE CDEF:2218size=2218used,2218free,+ CDEF:2218perc=2218used,2218size,/,100,* LINE1.25:2218perc#008C00:'/dev/shm ' GPRINT:2218size:LAST:%6.2lf%sB GPRINT:2218used:LAST:%6.2lf%sB GPRINT:2218perc:LAST:%5.2lf%%\l DEF:2219used=s054vmhomelk/storage-hrstorage-_run.rrd:used:AVERAGE DEF:2219free=s054vmhomelk/storage-hrstorage-_run.rrd:free:AVERAGE CDEF:2219size=2219used,2219free,+ CDEF:2219perc=2219used,2219size,/,100,* LINE1.25:2219perc#4096EE:'/run ' GPRINT:2219size:LAST:%6.2lf%sB GPRINT:2219used:LAST:%6.2lf%sB GPRINT:2219perc:LAST:%5.2lf%%\l DEF:2220used=s054vmhomelk/storage-hrstorage-_sys_fs_cgroup.rrd:used:AVERAGE DEF:2220free=s054vmhomelk/storage-hrstorage-_sys_fs_cgroup.rrd:free:AVERAGE CDEF:2220size=2220used,2220free,+ CDEF:2220perc=2220used,2220size,/,100,* LINE1.25:2220perc#73880A:'/sys/fs/cgro ' GPRINT:2220size:LAST:%6.2lf%sB GPRINT:2220used:LAST:%6.2lf%sB GPRINT:2220perc:LAST:%5.2lf%%\l DEF:2221used=s054vmhomelk/storage-hrstorage-_boot.rrd:used:AVERAGE DEF:2221free=s054vmhomelk/storage-hrstorage-_boot.rrd:free:AVERAGE CDEF:2221size=2221used,2221free,+ CDEF:2221perc=2221used,2221size,/,100,* LINE1.25:2221perc#D01F3C:'/boot ' GPRINT:2221size:LAST:%6.2lf%sB GPRINT:2221used:LAST:%6.2lf%sB GPRINT:2221perc:LAST:%5.2lf%%\l DEF:2222used=s054vmhomelk/storage-hrstorage-_var.rrd:used:AVERAGE DEF:2222free=s054vmhomelk/storage-hrstorage-_var.rrd:free:AVERAGE CDEF:2222size=2222used,2222free,+ CDEF:2222perc=2222used,2222size,/,100,* LINE1.25:2222perc#36393D:'/var ' GPRINT:2222size:LAST:%6.2lf%sB GPRINT:2222used:LAST:%6.2lf%sB GPRINT:2222perc:LAST:%5.2lf%%\l DEF:2223used=s054vmhomelk/storage-hrstorage-_opt.rrd:used:AVERAGE DEF:2223free=s054vmhomelk/storage-hrstorage-_opt.rrd:free:AVERAGE CDEF:2223size=2223used,2223free,+ CDEF:2223perc=2223used,2223size,/,100,* LINE1.25:2223perc#FF0084:'/opt ' GPRINT:2223size:LAST:%6.2lf%sB GPRINT:2223used:LAST:%6.2lf%sB GPRINT:2223perc:LAST:%5.2lf%%\l --daemon server.ip:42218

command returned (150x45 OK u:0.01 s:0.01 r:1.29 )

-rw-rw-r-- 1 nginx nginx 263 Jan 18 15:08 /tmp/1KZYgqfdfUpiUVmU graph
Runtime 1.343s
SNMP [0/0.00s]: Get[0/0.00s] Getnext[0/0.00s] Walk[0/0.00s] MySQL [12/0.00s]: Cell[0/0.00s] Row[2/0.00s] Rows[8/0.00s] Column[1/0.00s] Update[0/0.00s] Insert[0/0.00s] Delete[1/0.00s] RRD [0/0.00s]: Update[0/0.00s] Create [0/0.00s] Other[0/0.00s]

This one takes more time to return and I don’t have a line showing the files permissions, but also I have no idea what could have gone wrong.

graph /tmp/rYuMpVAvnX8TYVz7 -g -l 0 -u 100 -E --start 1516208400 --end 1516294800 --width 150 --height 45 -c BACK#EEEEEE00 -c SHADEA#EEEEEE00 -c SHADEB#EEEEEE00 -c FONT#000000 -c CANVAS#FFFFFF00 -c GRID#a5a5a5 -c MGRID#FF9999 -c FRAME#5e5e5e -c ARROW#5e5e5e -R normal -c CANVAS#FFFFFF00 --only-graph --font LEGEND:7:DejaVuSansMono --font AXIS:6:DejaVuSansMono --font-render-mode normal COMMENT:'Load % Now Min Max Avg\l' DEF:usage0=s054vmhomelk/processor-hr-196608.rrd:usage:AVERAGE DEF:usage0min=s054vmhomelk/processor-hr-196608.rrd:usage:MIN DEF:usage0max=s054vmhomelk/processor-hr-196608.rrd:usage:MAX CDEF:usage_cdef0=usage0,16,/ CDEF:usage_cdef0min=usage0min,16,/ CDEF:usage_cdef0max=usage0max,16,/ AREA:usage_cdef0#E43C00:'Intel Xeon E5-26 ' GPRINT:usage0:LAST:%5.2lf%s GPRINT:usage0min:MIN:%5.2lf%s GPRINT:usage0max:MAX:%5.2lf%s GPRINT:usage0:AVERAGE:'%5.2lf%s\n' COMMENT:'\n' DEF:usage1=s054vmhomelk/processor-hr-196609.rrd:usage:AVERAGE DEF:usage1min=s054vmhomelk/processor-hr-196609.rrd:usage:MIN DEF:usage1max=s054vmhomelk/processor-hr-196609.rrd:usage:MAX CDEF:usage_cdef1=usage1,16,/ CDEF:usage_cdef1min=usage1min,16,/ CDEF:usage_cdef1max=usage1max,16,/ AREA:usage_cdef1#E74B00:'Intel Xeon E5-26 ':STACK GPRINT:usage1:LAST:%5.2lf%s GPRINT:usage1min:MIN:%5.2lf%s GPRINT:usage1max:MAX:%5.2lf%s GPRINT:usage1:AVERAGE:'%5.2lf%s\n' COMMENT:'\n' DEF:usage2=s054vmhomelk/processor-hr-196610.rrd:usage:AVERAGE DEF:usage2min=s054vmhomelk/processor-hr-196610.rrd:usage:MIN DEF:usage2max=s054vmhomelk/processor-hr-196610.rrd:usage:MAX CDEF:usage_cdef2=usage2,16,/ CDEF:usage_cdef2min=usage2min,16,/ CDEF:usage_cdef2max=usage2max,16,/ AREA:usage_cdef2#EB5B00:'Intel Xeon E5-26 ':STACK GPRINT:usage2:LAST:%5.2lf%s GPRINT:usage2min:MIN:%5.2lf%s GPRINT:usage2max:MAX:%5.2lf%s GPRINT:usage2:AVERAGE:'%5.2lf%s\n' COMMENT:'\n' DEF:usage3=s054vmhomelk/processor-hr-196611.rrd:usage:AVERAGE DEF:usage3min=s054vmhomelk/processor-hr-196611.rrd:usage:MIN DEF:usage3max=s054vmhomelk/processor-hr-196611.rrd:usage:MAX CDEF:usage_cdef3=usage3,16,/ CDEF:usage_cdef3min=usage3min,16,/ CDEF:usage_cdef3max=usage3max,16,/ AREA:usage_cdef3#EF6A00:'Intel Xeon E5-26 ':STACK GPRINT:usage3:LAST:%5.2lf%s GPRINT:usage3min:MIN:%5.2lf%s GPRINT:usage3max:MAX:%5.2lf%s GPRINT:usage3:AVERAGE:'%5.2lf%s\n' COMMENT:'\n' DEF:usage4=s054vmhomelk/processor-hr-196612.rrd:usage:AVERAGE DEF:usage4min=s054vmhomelk/processor-hr-196612.rrd:usage:MIN DEF:usage4max=s054vmhomelk/processor-hr-196612.rrd:usage:MAX CDEF:usage_cdef4=usage4,16,/ CDEF:usage_cdef4min=usage4min,16,/ CDEF:usage_cdef4max=usage4max,16,/ AREA:usage_cdef4#F37900:'Intel Xeon E5-26 ':STACK GPRINT:usage4:LAST:%5.2lf%s GPRINT:usage4min:MIN:%5.2lf%s GPRINT:usage4max:MAX:%5.2lf%s GPRINT:usage4:AVERAGE:'%5.2lf%s\n' COMMENT:'\n' DEF:usage5=s054vmhomelk/processor-hr-196613.rrd:usage:AVERAGE DEF:usage5min=s054vmhomelk/processor-hr-196613.rrd:usage:MIN DEF:usage5max=s054vmhomelk/processor-hr-196613.rrd:usage:MAX CDEF:usage_cdef5=usage5,16,/ CDEF:usage_cdef5min=usage5min,16,/ CDEF:usage_cdef5max=usage5max,16,/ AREA:usage_cdef5#F78800:'Intel Xeon E5-26 ':STACK GPRINT:usage5:LAST:%5.2lf%s GPRINT:usage5min:MIN:%5.2lf%s GPRINT:usage5max:MAX:%5.2lf%s GPRINT:usage5:AVERAGE:'%5.2lf%s\n' COMMENT:'\n' DEF:usage6=s054vmhomelk/processor-hr-196614.rrd:usage:AVERAGE DEF:usage6min=s054vmhomelk/processor-hr-196614.rrd:usage:MIN DEF:usage6max=s054vmhomelk/processor-hr-196614.rrd:usage:MAX CDEF:usage_cdef6=usage6,16,/ CDEF:usage_cdef6min=usage6min,16,/ CDEF:usage_cdef6max=usage6max,16,/ AREA:usage_cdef6#FB9700:'Intel Xeon E5-26 ':STACK GPRINT:usage6:LAST:%5.2lf%s GPRINT:usage6min:MIN:%5.2lf%s GPRINT:usage6max:MAX:%5.2lf%s GPRINT:usage6:AVERAGE:'%5.2lf%s\n' COMMENT:'\n' DEF:usage7=s054vmhomelk/processor-hr-196615.rrd:usage:AVERAGE DEF:usage7min=s054vmhomelk/processor-hr-196615.rrd:usage:MIN DEF:usage7max=s054vmhomelk/processor-hr-196615.rrd:usage:MAX CDEF:usage_cdef7=usage7,16,/ CDEF:usage_cdef7min=usage7min,16,/ CDEF:usage_cdef7max=usage7max,16,/ AREA:usage_cdef7#FFA700:'Intel Xeon E5-26 ':STACK GPRINT:usage7:LAST:%5.2lf%s GPRINT:usage7min:MIN:%5.2lf%s GPRINT:usage7max:MAX:%5.2lf%s GPRINT:usage7:AVERAGE:'%5.2lf%s\n' COMMENT:'\n' DEF:usage8=s054vmhomelk/processor-hr-196616.rrd:usage:AVERAGE DEF:usage8min=s054vmhomelk/processor-hr-196616.rrd:usage:MIN DEF:usage8max=s054vmhomelk/processor-hr-196616.rrd:usage:MAX CDEF:usage_cdef8=usage8,16,/ CDEF:usage_cdef8min=usage8min,16,/ CDEF:usage_cdef8max=usage8max,16,/ AREA:usage_cdef8#E43C00:'Intel Xeon E5-26 ':STACK GPRINT:usage8:LAST:%5.2lf%s GPRINT:usage8min:MIN:%5.2lf%s GPRINT:usage8max:MAX:%5.2lf%s GPRINT:usage8:AVERAGE:'%5.2lf%s\n' COMMENT:'\n' DEF:usage9=s054vmhomelk/processor-hr-196617.rrd:usage:AVERAGE DEF:usage9min=s054vmhomelk/processor-hr-196617.rrd:usage:MIN DEF:usage9max=s054vmhomelk/processor-hr-196617.rrd:usage:MAX CDEF:usage_cdef9=usage9,16,/ CDEF:usage_cdef9min=usage9min,16,/ CDEF:usage_cdef9max=usage9max,16,/ AREA:usage_cdef9#E74B00:'Intel Xeon E5-26 ':STACK GPRINT:usage9:LAST:%5.2lf%s GPRINT:usage9min:MIN:%5.2lf%s GPRINT:usage9max:MAX:%5.2lf%s GPRINT:usage9:AVERAGE:'%5.2lf%s\n' COMMENT:'\n' DEF:usage10=s054vmhomelk/processor-hr-196618.rrd:usage:AVERAGE DEF:usage10min=s054vmhomelk/processor-hr-196618.rrd:usage:MIN DEF:usage10max=s054vmhomelk/processor-hr-196618.rrd:usage:MAX CDEF:usage_cdef10=usage10,16,/ CDEF:usage_cdef10min=usage10min,16,/ CDEF:usage_cdef10max=usage10max,16,/ AREA:usage_cdef10#EB5B00:'Intel Xeon E5-26 ':STACK GPRINT:usage10:LAST:%5.2lf%s GPRINT:usage10min:MIN:%5.2lf%s GPRINT:usage10max:MAX:%5.2lf%s GPRINT:usage10:AVERAGE:'%5.2lf%s\n' COMMENT:'\n' DEF:usage11=s054vmhomelk/processor-hr-196619.rrd:usage:AVERAGE DEF:usage11min=s054vmhomelk/processor-hr-196619.rrd:usage:MIN DEF:usage11max=s054vmhomelk/processor-hr-196619.rrd:usage:MAX CDEF:usage_cdef11=usage11,16,/ CDEF:usage_cdef11min=usage11min,16,/ CDEF:usage_cdef11max=usage11max,16,/ AREA:usage_cdef11#EF6A00:'Intel Xeon E5-26 ':STACK GPRINT:usage11:LAST:%5.2lf%s GPRINT:usage11min:MIN:%5.2lf%s GPRINT:usage11max:MAX:%5.2lf%s GPRINT:usage11:AVERAGE:'%5.2lf%s\n' COMMENT:'\n' DEF:usage12=s054vmhomelk/processor-hr-196620.rrd:usage:AVERAGE DEF:usage12min=s054vmhomelk/processor-hr-196620.rrd:usage:MIN DEF:usage12max=s054vmhomelk/processor-hr-196620.rrd:usage:MAX CDEF:usage_cdef12=usage12,16,/ CDEF:usage_cdef12min=usage12min,16,/ CDEF:usage_cdef12max=usage12max,16,/ AREA:usage_cdef12#F37900:'Intel Xeon E5-26 ':STACK GPRINT:usage12:LAST:%5.2lf%s GPRINT:usage12min:MIN:%5.2lf%s GPRINT:usage12max:MAX:%5.2lf%s GPRINT:usage12:AVERAGE:'%5.2lf%s\n' COMMENT:'\n' DEF:usage13=s054vmhomelk/processor-hr-196621.rrd:usage:AVERAGE DEF:usage13min=s054vmhomelk/processor-hr-196621.rrd:usage:MIN DEF:usage13max=s054vmhomelk/processor-hr-196621.rrd:usage:MAX CDEF:usage_cdef13=usage13,16,/ CDEF:usage_cdef13min=usage13min,16,/ CDEF:usage_cdef13max=usage13max,16,/ AREA:usage_cdef13#F78800:'Intel Xeon E5-26 ':STACK GPRINT:usage13:LAST:%5.2lf%s GPRINT:usage13min:MIN:%5.2lf%s GPRINT:usage13max:MAX:%5.2lf%s GPRINT:usage13:AVERAGE:'%5.2lf%s\n' COMMENT:'\n' DEF:usage14=s054vmhomelk/processor-hr-196622.rrd:usage:AVERAGE DEF:usage14min=s054vmhomelk/processor-hr-196622.rrd:usage:MIN DEF:usage14max=s054vmhomelk/processor-hr-196622.rrd:usage:MAX CDEF:usage_cdef14=usage14,16,/ CDEF:usage_cdef14min=usage14min,16,/ CDEF:usage_cdef14max=usage14max,16,/ AREA:usage_cdef14#FB9700:'Intel Xeon E5-26 ':STACK GPRINT:usage14:LAST:%5.2lf%s GPRINT:usage14min:MIN:%5.2lf%s GPRINT:usage14max:MAX:%5.2lf%s GPRINT:usage14:AVERAGE:'%5.2lf%s\n' COMMENT:'\n' DEF:usage15=s054vmhomelk/processor-hr-196623.rrd:usage:AVERAGE DEF:usage15min=s054vmhomelk/processor-hr-196623.rrd:usage:MIN DEF:usage15max=s054vmhomelk/processor-hr-196623.rrd:usage:MAX CDEF:usage_cdef15=usage15,16,/ CDEF:usage_cdef15min=usage15min,16,/ CDEF:usage_cdef15max=usage15max,16,/ AREA:usage_cdef15#FFA700:'Intel Xeon E5-26 ':STACK GPRINT:usage15:LAST:%5.2lf%s GPRINT:usage15min:MIN:%5.2lf%s GPRINT:usage15max:MAX:%5.2lf%s GPRINT:usage15:AVERAGE:'%5.2lf%s\n' COMMENT:'\n' --daemon server.ip:42218

command returned ()

graph /tmp/rYuMpVAvnX8TYVz7 -g --alt-autoscale-max --rigid -E --start 1516208400 --end 1516294800 --width 150 --height 45 -c BACK#EEEEEE00 -c SHADEA#EEEEEE00 -c SHADEB#EEEEEE00 -c FONT#000000 -c CANVAS#FFFFFF00 -c GRID#a5a5a5 -c MGRID#FF9999 -c FRAME#5e5e5e -c ARROW#5e5e5e -R normal -c CANVAS#FFFFFF00 --only-graph --font LEGEND:7:DejaVuSansMono --font AXIS:6:DejaVuSansMono --font-render-mode normal HRULE:0#555555 --title='Draw Error' --daemon server.ip:42218

command returned ()

�PNG  IHDR�-v�PLTE�FO&ZIDAT(�c`�`$����IEND�B`�
Runtime 39.937s
SNMP [0/0.00s]: Get[0/0.00s] Getnext[0/0.00s] Walk[0/0.00s] MySQL [12/0.00s]: Cell[0/0.00s] Row[2/0.00s] Rows[8/0.00s] Column[1/0.00s] Update[0/0.00s] Insert[0/0.00s] Delete[1/0.00s] RRD [0/0.00s]: Update[0/0.00s] Create [0/0.00s] Other[0/0.00s]

You need to look into what’s taking so long to generate that graph, unfortunately I’ve not got any suggestions on where to start with that.

What I think is that the problem is in RRDCached performance, since that for generate each graphic on the Web UI it will make a connection to the RRDCached, that’s what I saw monitoring the connections on the rrdcached server

I’m already running RRDCached on a dedicated machine with 16 cores and RRDCached only scales-up, I didn’t find any way to run another RRDCached to split the load, and I can’t scale up infinitely.

The solution that I’m thinking right will be to make RRDCached write to disk each 5 minutes, and configure the web to do not connect to RRDCached, just read the rrds from the shared folder, but this is not the ideal.

Do you think in anything that I could do to improve RRDCached performance?

We run our web server through rrdcached and don’t see these issues (using svg graph output btw). 4k devices, we have quite a few users logging in each day and make use of graphs via the API for clients.

As an example, the numbers of hits for graphs for us is about 100k today alone all through one web server which access the rrd files via rrdcached.

Maybe run strace on the rrdcached to see what it’s doing.

Yeah, I’m going to try that to see if I can find anything else.

Do you use a NFS export for the rrd files or are they in the disk in the rrd machine? I’m using a NFS export, which I mount in each node also.

Yup using nfs here

Well, just an update, I solved the problem, the cause apparently was the slow connection between the server with the rrd files to the rrdcached server.

I’ve changed the things a little adding a disk backed by ssd on the rrdcached server and moved the rrd files to it, now everything else was improved, the Web UI loads very fast, the polling time was reduced and the load on the rrdcached server was also improved.

1 Like