Polling stops every day during daily.sh run

Recently, we have gaps in graphs at the same time most nights. After reviewing daily.log, I found the gaps always happen a few minutes after the daily.sh starts (4am for my install). When manually running daily.sh, polling stops during the “Cleaning up DB” section.

I noticed that traffic to our mariadb cluster (galera) drops to near nothing during the ~10min window when polling stops, so it seems like the pollers are either not reaching out to librenms, or the mariadb cluster stops responding (locked tables?).

I have been reviewing this thread: Daily gaps in graphs - Around 12-1am - #9 by kzaykov
and the related PR that was merged: Fix midnight poller data loss by TheMysteriousX · Pull Request #11582 · librenms/librenms · GitHub
I am running the librenms-docker image for distributed polling, which does include psutil python package, so that PR fix should be implemented.

Has anyone experienced this?

  • Steps to reproduce an issue:

    • run daily.sh as librenms user
    • wait for “Cleaning up DB” section
    • polling stops/gaps in graph start
    • daily.sh completes
    • polling starts again/gaps end
  • The output of ./validate.php:

        $ ./validate.php
        ====================================
        Component | Version
        --------- | -------
        LibreNMS  | 21.4.0
        DB Schema | 2021_04_08_151101_add_foreign_keys_to_port_group_port_table (208)
        PHP       | 7.4.3
        Python    | 3.8.5
        MySQL     | 10.5.9-MariaDB-1:10.5.9+maria~focal
        RRDTool   | 1.7.2
        SNMP      | NET-SNMP 5.8
        ====================================

        [OK]    Composer Version: 2.0.13
        [OK]    Dependencies up-to-date.
        [OK]    Database connection successful
        [FAIL]  Database: incorrect column (notifications/datetime)
        [FAIL]  Database: incorrect column (users/created_at)
        [FAIL]  We have detected that your database schema may be wrong, please report the following to us on Discord (https://t.libren.ms/discord) or the community site (https://t.libren.ms/5gscd):
                [FIX]:
                Run the following SQL statements to fix.
                SQL Statements:
                 ALTER TABLE `notifications` CHANGE `datetime` `datetime` timestamp NOT NULL DEFAULT '1970-01-02 00:00:00' ;
                 ALTER TABLE `users` CHANGE `created_at` `created_at` timestamp NOT NULL DEFAULT '1970-01-02 00:00:01' ;
        [INFO]  Detected Dispatcher Service
        [WARN]  Using database for distributed locking, you should set CACHE_DRIVER=redis
        [FAIL]  Dispatcher service is enabled on your cluster, but not in use on this node

“Cleaning up DB” logs:

Cleaning up DB
Refreshing alert rules queries
Clearing OS cache
Refreshing device group table relationships
Syslog cleared for entries over 30 days (about 51000 rows)
Eventlog cleared for entries over 30 days
Authlog cleared for entries over 365 days
Device perf cleared for entries over 7 days
Deleting alert_logs more than 365 days that are not active
Deleting history of active alert_logs more than 365 days
Ports fdb cleared for entries over 10 days
Returned: 0

Aggregate bandwidth of all three mariadb servers during 4am poller gap (from netdata):

I’ve added some debug logs to the daily.sh script to see which command in the “Cleaning up DB” section is causing the gaps. I’ll follow up when I get more info.

daily.sh

call_daily_php() {
    local args

    args=("$@")

    for arg in "${args[@]}"; do
        echo "  DEBUG $(date) | Starting daily.php -f ${arg}"
        php "${LIBRENMS_DIR}/daily.php" -f "${arg}"
        echo "  DEBUG $(date) | Completing daily.php -f ${arg}"
    done
}

Monitor the queries in MySQL to see what is taking so long. Likely somey is refusing to get cleaned up…

Here’s the output from the daily.sh log:

  DEBUG Wed 12 May 2021 04:00:32 AM PDT | Starting daily.php -f refresh_alert_rules
Refreshing alert rules queries
  DEBUG Wed 12 May 2021 04:01:23 AM PDT | Completing daily.php -f refresh_alert_rules
  DEBUG Wed 12 May 2021 04:01:23 AM PDT | Starting daily.php -f refresh_os_cache
Clearing OS cache
  DEBUG Wed 12 May 2021 04:01:24 AM PDT | Completing daily.php -f refresh_os_cache
  DEBUG Wed 12 May 2021 04:01:24 AM PDT | Starting daily.php -f refresh_device_groups
Refreshing device group table relationships
  DEBUG Wed 12 May 2021 04:01:39 AM PDT | Completing daily.php -f refresh_device_groups
  DEBUG Wed 12 May 2021 04:01:39 AM PDT | Starting daily.php -f recalculate_device_dependencies
  DEBUG Wed 12 May 2021 04:02:32 AM PDT | Completing daily.php -f recalculate_device_dependencies
  DEBUG Wed 12 May 2021 04:02:32 AM PDT | Starting daily.php -f syslog
Syslog cleared for entries over 30 days (about 0 rows)
  DEBUG Wed 12 May 2021 04:06:55 AM PDT | Completing daily.php -f syslog
  DEBUG Wed 12 May 2021 04:06:55 AM PDT | Starting daily.php -f eventlog
  DEBUG Wed 12 May 2021 04:06:55 AM PDT | Completing daily.php -f eventlog
  DEBUG Wed 12 May 2021 04:06:55 AM PDT | Starting daily.php -f authlog
Authlog cleared for entries over 365 days
  DEBUG Wed 12 May 2021 04:08:53 AM PDT | Completing daily.php -f authlog
  DEBUG Wed 12 May 2021 04:08:53 AM PDT | Starting daily.php -f callback
  DEBUG Wed 12 May 2021 04:08:54 AM PDT | Completing daily.php -f callback
  DEBUG Wed 12 May 2021 04:08:54 AM PDT | Starting daily.php -f device_perf
Device perf cleared for entries over 7 days
  DEBUG Wed 12 May 2021 04:14:08 AM PDT | Completing daily.php -f device_perf
  DEBUG Wed 12 May 2021 04:14:08 AM PDT | Starting daily.php -f purgeusers
Could not connect to database, check logs/librenms.log.
  DEBUG Wed 12 May 2021 04:14:08 AM PDT | Completing daily.php -f purgeusers
  DEBUG Wed 12 May 2021 04:14:08 AM PDT | Starting daily.php -f bill_data
Could not connect to database, check logs/librenms.log.
  DEBUG Wed 12 May 2021 04:14:09 AM PDT | Completing daily.php -f bill_data
  DEBUG Wed 12 May 2021 04:14:09 AM PDT | Starting daily.php -f alert_log
Could not connect to database, check logs/librenms.log.
  DEBUG Wed 12 May 2021 04:14:10 AM PDT | Completing daily.php -f alert_log
  DEBUG Wed 12 May 2021 04:14:10 AM PDT | Starting daily.php -f rrd_purge
Could not connect to database, check logs/librenms.log.
  DEBUG Wed 12 May 2021 04:14:11 AM PDT | Completing daily.php -f rrd_purge
  DEBUG Wed 12 May 2021 04:14:11 AM PDT | Starting daily.php -f ports_fdb
Could not connect to database, check logs/librenms.log.
  DEBUG Wed 12 May 2021 04:14:11 AM PDT | Completing daily.php -f ports_fdb
  DEBUG Wed 12 May 2021 04:14:11 AM PDT | Starting daily.php -f route
Could not connect to database, check logs/librenms.log.
  DEBUG Wed 12 May 2021 04:14:12 AM PDT | Completing daily.php -f route
  DEBUG Wed 12 May 2021 04:14:12 AM PDT | Starting daily.php -f ports_purge
  DEBUG Wed 12 May 2021 04:14:12 AM PDT | Completing daily.php -f ports_purge
Returned: 0

Looks like the database drops out during the auth_log cleanup, and the device_perf cleanup. Those mostly line up with the bandwidth drops from the mariadb cluster:

@murrant I’m not sure what you mean here.

No, it is the stupid device_perf table… I suggest you truncate it.

I have cut down device_perf retention to just 1 day, and the gaps are smaller but not gone.

I found in the daily.php code that the syslog purge splits up the deletes into blocks of 1000 → here. I think that would benefit for device_perf and other tables that get large with many devices. I’m thinking I update the lock_and_purge function to do block deletes, but I’m not sure how to select the last 1000 entries on each table. Any suggestions? I’m submitting PR: WIP: Improve cleanup scripts by haydenseitz · Pull Request #12906 · librenms/librenms · GitHub

FYI I also reduced syslog retention from 30 days to 14 days, and that has helped close the gap for polling and SQL outages. If anyone else has issues with polling/alerts/etc during the daily.sh run, bringing down data retention should help. My SQL traffic only stops for about 1 minutes spread across two different windows (device_perf and syslog), and RRD graphs have not gaps:

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.