Docker app CPU graph has gaps during high usage

Jonathan_Schram · 10 July 2025 02:34

Hi, I made a bug on github about an issue I’ve been having with LibreNMS, but it was closed as a support issue so I’m hoping someone here can point me in the right direction.

I am monitoring Docker containers on Truenas Scale using the official Docker agent.

The problem I am having is that all of my CPU-heavy containers render with gaps when the CPU usage is too high. Here is an example CPU graph:
yacy docker cpu graph

The graph is visible in places, but on others it is a series of short lines or even completely blank for large stretches of time. The host has 40 cores, 80 with hyperthreading so I don’t believe this is a lack of CPU power on the host. I don’t see any signs of a timeout in the logs or the poller timing graphs. Polling always completes before the start of the next poll cycle.

As evidence that polling is working fine, here is a memory graph over the same time period:
yacy docker memory graph
There are no issues here. All graphs other than CPU similarly render with no gaps. To me, this is strong evidence that SNMP is not timing out and the docker agent script is running on the host successfully.

I did some troubleshooting on my own and it looks like the docker agent can report CPU usage above 100%. For the docker container above, I allocated 8 CPUs and docker-stats.py reports "cpu": "767.23%". That lines up pretty well with how Linux’s top utility works; I’m not sure if this is a side effect of the platform (Linux) of the host or if the Docker command that docker-stats.py invokes deliberately does this.

My hypothesis here is that LibreNMS expects the CPU graph to always be between 0 and 100%, but doesn’t account for this behavior and so it discards everything out of that range. It matches what I tend to see, where the graph renders fine when I know the container was mostly idle, but has gaps and disappears when I know it was being used heavily. My guess here is that removing the 100% limit should remove the gaps.

On the Github bug, I was recommended to modify the RRD file, but the syntax is very complicated. Would someone be able to help me with this issue? I don’t like the idea of having to recreate the RRD file every time I add a new docker container, but it would at least allow me to see the graphs in full.

Steps to reproduce:

Set up the docker agent on the host, enable Docker polling in LibreNMS
Install a docker container that uses a lot of CPU, and assign at least 2 cores. The more CPU cores, the easier it is to see the effect. (whisper-asr is great, so is Yacy that I’m using here)
Run the container, ensure LibreNMS is polling Docker, and ensure the container is using a lot of CPU power
Go to the Docker app in LibreNms and observe the CPU graph

I’m using Chrome Version 138.0.7204.97 (Official Build) (64-bit)

./validate.php output:

===========================================
Component | Version
--------- | -------
LibreNMS  | 25.6.0-142-g4ae643f0c (2025-07-09T14:09:29-04:00)
DB Schema | 2025_07_08_111910_change_stp_bridge_max_age_size (351)
PHP       | 8.3.21
Python    | 3.12.9
Database  | MariaDB 11.8.2-MariaDB-ubu2404
RRDTool   | 1.8.0
SNMP      | 5.9.4.pre2
===========================================

[OK]    Composer Version: 2.8.9
[OK]    Dependencies up-to-date.
[OK]    Database Connected
[OK]    Database Schema is current
[OK]    SQL Server meets minimum requirements
[OK]    lower_case_table_names is enabled
[OK]    MySQL engine is optimal
[OK]    Database and column collations are correct
[OK]    Database schema correct
[OK]    MySQL and PHP time match
[OK]    Active pollers found
[OK]    Dispatcher Service not detected
[OK]    Locks are functional
[OK]    Python poller wrapper is polling
[OK]    Redis is unavailable
[OK]    rrdtool version ok
[OK]    Connected to rrdcached

Thanks in advance!

murrant · 10 July 2025 05:27

The max CPU % is 100%.

Jonathan_Schram · 13 July 2025 01:46

I get that the max CPU is 100%. That’s the problem here. docker-stats.py returns 767.23% and it is breaking the graph.

It seems like it should be simple to remove this restriction (for someone familiar with the system at least) but I don’t want to assume. Unless you mean there are large design requirements for LibreNMS that require max CPU to always be 100% and this is impossible to change. Or maybe there is a strong incentive not to force everyone to recreate their rrd files. If this is the case, could docker-stats.py be changed to always return a value under 100%? Maybe divide by # of allocated CPU cores?

The different behavior between the graphs (0-100%) and docker-stats.py (0 to hundreds of percent) seems like something that should be fixed.

ChrisK928 · 15 July 2025 14:26

I think that’s the problem here. Seems like the fact that something could consume more that one CPU-core was not respected. So either the script should return the load divided by the number of cores, or the SNMP-backend should deal with values larger than 100%.