LibreNMS Graphs Stopped Populating - why? How to restore?

hi @jihaddaouk, this server has many devices, something like 200. I imagined when starting the job again it might take a while to catch up but then settle, it doesn’t seem like this is the case.

I can add devices, but would appreciate any thoughts on why this is now choppy. It was not choppy before I resized the HDD.

I also wonder why the rrdcached docker container isn’t working, that should replace the need for a cron job from what I’m seeing.

This server is only used for LibreNMS.

Hi @liamnap,
Never worked with rrdcached, but in the devices page are you able to see these 200 devices?

You can run the command docker service ls and check the container that is related to rrdcached.

Best regards,

hey @jihaddaouk, thank you again, here’s what I am looking at for rrdcached:

[email protected]:~$ docker service ls
Error response from daemon: This node is not a swarm manager. Use "docker swarm init" or "docker swarm join" to connect this node to swarm and try again.
[email protected]:~$ docker ps
CONTAINER ID   IMAGE                      COMMAND                  CREATED         STATUS                  PORTS                                                          NAMES
cfd55543a8e1   adolfintel/speedtest       "docker-php-entrypoi…"   15 months ago   Up 2 days               0.0.0.0:80->80/tcp, :::80->80/tcp                              librespeed
db210a218757   librenms/librenms:latest   "/init"                  16 months ago   Up 21 hours             514/tcp, 514/udp, 0.0.0.0:8000->8000/tcp, :::8000->8000/tcp    nms_librenms
0b11c6a03748   balabit/syslog-ng          "/usr/sbin/syslog-ng…"   2 years ago     Up 4 weeks (healthy)    601/tcp, 6514/tcp, 0.0.0.0:50514->514/udp, :::50514->514/udp   nms_syslog
455bff71219e   grafana/grafana:7.3.3      "/run.sh"                2 years ago     Up 4 weeks              0.0.0.0:3000->3000/tcp, :::3000->3000/tcp                      nms_grafana
6452321d9a41   telegraf                   "/entrypoint.sh tele…"   2 years ago     Up 4 weeks              8092/udp, 8125/udp, 8094/tcp                                   nms_telegraf
9316ec8448be   influxdb                   "/entrypoint.sh infl…"   2 years ago     Up 4 weeks              8086/tcp                                                       nms_influxdb
36a8765bd68d   oxidized/oxidized:latest   "/sbin/my_init"          2 years ago     Up 13 days              0.0.0.0:8888->8888/tcp, :::8888->8888/tcp                      nms_oxidized
59f7d62c9ab9   librenms/librenms:latest   "/init"                  2 years ago     Up 4 weeks              514/tcp, 8000/tcp, 514/udp                                     nms_dispatcher
68b78f335430   andyshinn/dnsmasq:2.75     "dnsmasq -k"             2 years ago     Up 7 days               53/tcp, 53/udp                                                 nms_dns
b956a4d03546   mariadb:10.2               "docker-entrypoint.s…"   2 years ago     Up 4 weeks              3306/tcp                                                       nms_db
3e4915e76a01   redis:5.0-alpine           "docker-entrypoint.s…"   2 years ago     Up 4 weeks              6379/tcp                                                       nms_redis
c6cf6f102294   memcached:alpine           "docker-entrypoint.s…"   2 years ago     Up 4 weeks              11211/tcp                                                      nms_memcached
3e1bf3250c15   crazymax/rrdcached         "/init"                  2 years ago     Up 39 hours (healthy)   42217/tcp                                                      nms_rrdcached
[email protected]:~$ systemctl status rrdcached.service
Unit rrdcached.service could not be found.
[email protected]:~$ docker exec -it nms_rrdcached bash
bash-5.0# rrdcached start
rrdcached: can't create pid file '/usr/var/run/rrdcached.pid' (File exists)
FATAL: Another rrdcached daemon is running?? (pid 89176)
rrdcached: daemonize failed, exiting.
bash-5.0# systemctl status rrdcached
bash: systemctl: command not found
bash-5.0# exit

Hi,

You need to try one of the following commands,

service rrdcached status
or
ps -ef | grep rrdcached
And since you don’t have systemd installed in you container image, you must be able to use service rrdcached restart|stop|start.

Give it a try and let’s see.

Best regards,

You’re an absolute font of knowledge, thank you for this :slight_smile:

What’re we thinking here?

[email protected]:~$ docker exec -it nms_rrdcached bash
bash-5.0# service rrdcached status
bash: service: command not found
bash-5.0# ps -ef | grep rrdcached
 1348 root      0:00 s6-supervise rrdcached
 1352 rrdcache  0:18 /usr/sbin/rrdcached -g -L -F -B -R -l /var/run/rrdcached/rrdcached.sock -p /var/run/rrdcached/rrdcached.pid -b /data/db -j /data/journal -U rrdcached -G rrdcached -w 1800 -z 1800 -f 3600 -t 4 -V LOG_INFO
89176 root      0:03 rrdcached info
150272 root      0:00 grep rrdcached
bash-5.0# service rrdcached restart
bash: service: command not found
bash-5.0# service rrdcached --help
bash: service: command not found

Hi,

Sounds like it is not run as a service. You can do the following,

kill -9 1352 (this is the process used by rrdcahed)
delete the file /usr/var/run/rrdcached.pid
rrdcached start

BR,

Thanks to @snmpd and @jihaddaouk

I’m somewhat there now, here’s the fix I undertook to recover this. I still cannot understand the relationship between rrdcached and librenms when using containers, I still cannot understand why rrdcached didn’t kick in or why graphs start to look choppy when the environment hasn’t changed other than HDD size. But alas I’ve learnt a lot.

So running the cron job as listed by jihaddaouk is useful. Although should not be necessary with the containers, as observed in my other working servers. Use crontab -l to check when in the container. This does seem to help get the processes running. Command once in the container is crontab librenms.cron (without the root user in the file, that was removed per jihaddaouks advice earlier in the thread, when running librenms.nonroot.cron I did not see the same results so have chosen not to use this cron).

Now, something else, I was clicking around the GUI and checked the pollers, compared to my working servers I noticed a lot of 0s in the workers seconds column. To fix this I increased the number of workers. To do this from your LibreNMS GUI:

  1. Hover the cog in the top right
  2. Select Poller > Poller (observe your consumed workers in seconds)
  3. Select Settings in the top headings of the page
  4. Expand to Advanced (top right)
  5. Grow your workers for Pollers, Discovery and Workers.
  6. I turned off Billing as not required for my deployment (although this still seems to run…)

With the con job running and increased workers my graphs did seem to populate.

I still seem to have an issue that graphs stop but mostly it is better, and I think when it stops I can restart the container and if needed start the cron job, although starting the cron job is the last resort.

I might have to move away from Libre as this was too tough to troubleshoot, but a good exercise none the less.

Hi,

I believe rrdcached was used to enhance the performance whenever you have a huge number of devices. I have read some where in this community Librenms can monitor 20k devices.

Best regards,