Daily gaps in graphs - Around 12-1am

ersion 1.54 - Mon Jul 29 2019 03:47:34 GMT+0100
DB Schema 2019_07_03_132417_create_mpls_saps_table (139)
Web Server Apache/2.4.29 (Ubuntu)
PHP 7.3.7-2+ubuntu18.04.1+deb.sury.org+1
MySQL 5.7.27-0ubuntu0.18.04.1
RRDtool 1.7.0

Is any one able to help please? I have a 3 polling cluster for around 750 hosts. I have some gaps on Juniper hosts which appears to be an issue within their SNMP stack. However every day I get around 1hours worth of gap between 12am - 1am. I cannot seem to find what is causing it. My Cron jobs are default and if run manually do not appear to take long.
Has any one seen this please?

1 Like

What does your LibreNMS poller CPU utilisation get to during this time because I think I have a similar issue.

You may want to check what you have running in cron at that time, it sounds like there is something kicking off that would cause this

Are you using the Dispatcher Service? I am seeing similar behavior since switching from the Cron jobs to the service. I am in GMT-0400, and have gaps every evening from 20:00 - 20:20.

Anyone else running the service care to comment on whether you’ve seen this behavior?

Yes I am using this service.

I ended up moving back to the cron jobs.

I have similar issue since switching to the dispatcher service.

No other cron jobs run at this time. seems to be related to daily.sh:

[ Fri, 22 Nov 2019 00:03:14 +0000 ] LibreNMS Notifications (44)
[ Fri, 22 Nov 2019 00:03:14 +0000 ] misc/notifications.rss (55)
[ Fri, 22 Nov 2019 00:03:14 +0000 ] Updating DB Done

1 Like

I actually think this is correct. Looking at my information from the application it showed 24 million lines in my event log and the daily.sh was locking when running.I manually dealt with the 24 million lines so I will see if the gaps disappear tonight. Despite the 3 polling machines, off board SQL the processors on all items are running terribly high and I do wish Juniper would sort their SNMP engine out so it polled suitably in the polling time.

the things are getting worse for me, today poller stopped for 12 minutes:

/opt/librenms/poller.php 137 2020-01-03 00:07:33 - 1 devices polled in 77.23 secs
/opt/librenms/poller.php 483 2020-01-03 00:07:34 - 1 devices polled in 70.50 secs
/opt/librenms/discovery.php 70 2020-01-03 00:08:05 - 1 devices discovered in 137.4 secs
/opt/librenms/discovery.php 77 2020-01-03 00:08:24 - 1 devices discovered in 131.0 secs
/opt/librenms/discovery.php 76 2020-01-03 00:08:24 - 1 devices discovered in 132.6 secs
/opt/librenms/discovery.php 71 2020-01-03 00:08:54 - 1 devices discovered in 186.1 secs
/opt/librenms/discovery.php 72 2020-01-03 00:08:59 - 1 devices discovered in 187.0 secs
/opt/librenms/discovery.php 65 2020-01-03 00:18:20 - 1 devices discovered in 787.8 secs
/opt/librenms/poller.php 633 2020-01-03 00:19:00 - 1 devices polled in 36.61 secs
/opt/librenms/poller.php 206 2020-01-03 00:19:00 - 1 devices polled in 37.28 secs
/opt/librenms/poller.php 61 2020-01-03 00:19:00 - 1 devices polled in 37.54 secs

In daily.logs I see it was dealing with around 1M rows, so what I’ve tried is to purge a bit more, I set syslog messages to be cleaned each day and re-run ./daily.sh:

Syslog cleared for entries over 1 days (about 6158000 rows)

And no gaps.

Seems something else but not purging unused data locks the poller.

Hi, could you please let me know if you managed to fix this issue permanently? I am facing that exact same problem, currently with a 26-minute gap:

/opt/librenms/discovery.php 267 2020-03-05 01:08:03 - 1 devices discovered in 48.29 secs
/opt/librenms/discovery.php 268 2020-03-05 01:08:06 - 1 devices discovered in 48.32 secs
/opt/librenms/discovery.php 257 2020-03-05 01:08:10 - 1 devices discovered in 81.83 secs
/opt/librenms/discovery.php 64 2020-03-05 01:34:11 - 1 devices discovered in 1994. secs
/opt/librenms/discovery.php 193 2020-03-05 01:36:01 - 1 devices discovered in 1915. secs
/opt/librenms/poller.php 38 2020-03-05 01:36:11 - 1 devices polled in 9.689 secs

Daily.log:

Updating to latest codebase
fatal: unable to access 'https://github.com/librenms/librenms.git/': Operation timed out after 300018 milliseconds with 0 out of 0 bytes received
Returned: 1
Updating Composer packages


  [Composer\Downloader\TransportException]
  The "https://getcomposer.org/versions" file could not be downloaded: SSL: Handshake timed out
  Failed to enable crypto
  failed to open stream: operation failed


self-update [-r|--rollback] [--clean-backups] [--no-progress] [--update-keys] [--stable] [--preview] [--snapshot] [--set-channel-only] [--] [<version>]

> LibreNMS\ComposerHelper::preInstall
Loading composer repositories with package information
Installing dependencies from lock file
Nothing to install or update
Generating optimized autoload files
> Illuminate\Foundation\ComposerScripts::postAutoloadDump
> @php artisan package:discover --ansi
Discovered Package: ^[[32mdarkghosthunter/larapoke^[[39m
Discovered Package: ^[[32mfideloper/proxy^[[39m
Discovered Package: ^[[32mlaravel/laravel^[[39m
Discovered Package: ^[[32mlaravel/tinker^[[39m
Discovered Package: ^[[32mmartinlindhe/laravel-vue-i18n-generator^[[39m
Discovered Package: ^[[32mnesbot/carbon^[[39m
Discovered Package: ^[[32moriceon/toastr-5-laravel^[[39m
Discovered Package: ^[[32mspatie/laravel-cors^[[39m
Discovered Package: ^[[32mtightenco/ziggy^[[39m
^[[32mPackage manifest generated successfully.^[[39m
> LibreNMS\ComposerHelper::postInstall
> Illuminate\Foundation\ComposerScripts::postInstall
> @php artisan vue-i18n:generate --multi-locales --format=umd
Returned: 0
Updating SQL-Schema
^[[32mNothing to migrate.^[[39m
Returned: 0
Updating submodules

Returned: 0
Cleaning up DB
Refreshing alert rules queries
Clearing OS cache
Refreshing device group table relationships
Syslog cleared for entries over 30 days (about 0 rows)
Authlog cleared for entries over 30 days
Perf times cleared for entries over 30 days
Device perf cleared for entries over 7 days
Deleting alert_logs more than 365 days that are not active
Deleting history of active alert_logs more than 365 days
All deleted ports now purged
Returned: 0
Fetching notifications
[ Thu, 05 Mar 2020 01:06:24 +0100 ] http://www.librenms.org/notifications.rss (1)
[ Thu, 05 Mar 2020 01:07:24 +0100 ] misc/notifications.rss (13)
[ Thu, 05 Mar 2020 01:07:24 +0100 ] Updating DB  Done
Returned: 0
Caching PeeringDB data
Peering DB integration disabled
Returned: 0

Thanks!

Could this issue be related to this error while accessing github? It is a proxy variable issue, as daily.sh works nicely when I run it manually:
librenms@no1-librenms-1:~/logs$ …/./daily.sh
Updating to latest codebase OK
Updating Composer packages OK
Updated from baee55a to 2c91724 OK
Updating SQL-Schema OK
Updating submodules OK
Cleaning up DB OK
Fetching notifications OK
Caching PeeringDB data OK

I have set the Environment variables in the service definition file now, defining the proxy URL. Tomorrow, when daily upgrade runs again, I will validate the change.

Updated:

The problem didn’t happen last night with the discover running with only 2 workers.

Do you have any progress on this?
I am still having gaps, not every day, but maybe 2-3 times a week.

I am still facing the issue. Today I increased the snmpbulkwalk repeaters to speed up the polls, let’s see what happens - overall poller time was reduced by 20%. The problem is that these gaps appear usually in the early hours, so I have to wait until tomorrow.
In the meantime, I have requested the server team to upscale the VM. Today I have an all-in-one 16-CPU box, monitoring 410 devices and over 20k ports, so it is very likely that a performance issue is also present.

I have around 1K devices, slowly moving to the distributed setup.

How many (v)CPUs do you have in place?
This is how poller config looks to me at the moment:

image

Hello,
Confirming this issue for dispatcher service without distributed poller setup.
No gaps with crontab, daily ~30min gap with systemd service, while service just stop polling wihout any errors in log.

it is Intel® Xeon® CPU E5-2670.

I also found out that librenms waits all processes to complete, and it causes the gaps:

Mar 18 00:04:31 librenms librenms-service.py[50328]: maintenance(INFO):Restarting service...
...
Mar 18 00:10:51 librenms librenms-service.py[50328]: Discovery_0-4(INFO):Completed discovery run for 18 in 583.28s
Mar 18 00:10:51 librenms librenms-service.py[50328]: MainThread(INFO):Configuring LibreNMS service

I don’t know if it is expected behavior.

I think its related to the daily maintenance, mainly due to cleanup

might be, but as you can see in the logs above there is a strict correlation between completion of the last discovery process and starting librenms-service