MariaDB throwing OOM errors

  • Steps to reproduce an issue: issue occurs every day
    • The output of ./validate.php:
Component | Version
--------- | -------
LibreNMS  | 24.4.0 (2024-04-27T00:15:14-04:00)
DB Schema | 2024_04_22_161711_custom_maps_add_group (292)
PHP       | 8.1.2-1ubuntu2.17
Python    | 3.10.12
Database  | MariaDB 10.6.16-MariaDB-0ubuntu0.22.04.1
RRDTool   | 1.7.2
SNMP      | 5.9.1

[OK]    Composer Version: 2.7.6
[OK]    Dependencies up-to-date.
[OK]    Database connection successful
[OK]    Database Schema is current
[OK]    SQL Server meets minimum requirements
[OK]    lower_case_table_names is enabled
[OK]    MySQL engine is optimal
[OK]    Database and column collations are correct
[OK]    Database schema correct
[OK]    MySQL and PHP time match
[FAIL]  No active polling method detected
[OK]    Dispatcher Service not detected
[OK]    Locks are functional
[FAIL]  No active python wrapper pollers found
[OK]    Redis is unavailable
[WARN]  Could not check Python dependencies because this script is not running as librenms
        The install docs show how this is done on a new install:
[OK]    rrd_dir is writable
[OK]    rrdtool version ok
warning: Not a git repository. Use --no-index to compare two paths outside a working tree
usage: git diff --no-index [<options>] <path> <path>

Diff output format options
    -p, --patch           generate patch
    -s, --no-patch        suppress diff output
    -u                    generate patch
    -U, --unified[=<n>]   generate diffs with <n> lines context
    -W, --function-context
                          generate diffs with <n> lines context
    --raw                 generate the diff in raw format
    --patch-with-raw      synonym for '-p --raw'
    --patch-with-stat     synonym for '-p --stat'
    --numstat             machine friendly --stat
    --shortstat           output only the last line of --stat
    -X, --dirstat[=<param1,param2>...]
                          output the distribution of relative amount of changes for each sub-directory
    --cumulative          synonym for --dirstat=cumulative
                          synonym for --dirstat=files,param1,param2...
    --check               warn if changes introduce conflict markers or whitespace errors
    --summary             condensed summary such as creations, renames and mode changes
    --name-only           show only names of changed files
    --name-status         show only names and status of changed files
                          generate diffstat
    --stat-width <width>  generate diffstat with a given width
    --stat-name-width <width>
                          generate diffstat with a given name width
    --stat-graph-width <width>
                          generate diffstat with a given graph width
    --stat-count <count>  generate diffstat with limited lines
    --compact-summary     generate compact summary in diffstat
    --binary              output a binary diff that can be applied
    --full-index          show full pre- and post-image object names on the "index" lines
    --color[=<when>]      show colored diff
    --ws-error-highlight <kind>
                          highlight whitespace errors in the 'context', 'old' or 'new' lines in the diff
    -z                    do not munge pathnames and use NULs as output field terminators in --raw or --numstat
    --abbrev[=<n>]        use <n> digits to display object names
    --src-prefix <prefix>
                          show the given source prefix instead of "a/"
    --dst-prefix <prefix>
                          show the given destination prefix instead of "b/"
    --line-prefix <prefix>
                          prepend an additional prefix to every line of output
    --no-prefix           do not show any source or destination prefix
    --inter-hunk-context <n>
                          show context between diff hunks up to the specified number of lines
    --output-indicator-new <char>
                          specify the character to indicate a new line instead of '+'
    --output-indicator-old <char>
                          specify the character to indicate an old line instead of '-'
    --output-indicator-context <char>
                          specify the character to indicate a context instead of ' '

Diff rename options
    -B, --break-rewrites[=<n>[/<m>]]
                          break complete rewrite changes into pairs of delete and create
    -M, --find-renames[=<n>]
                          detect renames
    -D, --irreversible-delete
                          omit the preimage for deletes
    -C, --find-copies[=<n>]
                          detect copies
    --find-copies-harder  use unmodified files as source to find copies
    --no-renames          disable rename detection
    --rename-empty        use empty blobs as rename source
    --follow              continue listing the history of a file beyond renames
    -l <n>                prevent rename/copy detection if the number of rename/copy targets exceeds given limit

Diff algorithm options
    --minimal             produce the smallest possible diff
    -w, --ignore-all-space
                          ignore whitespace when comparing lines
    -b, --ignore-space-change
                          ignore changes in amount of whitespace
                          ignore changes in whitespace at EOL
    --ignore-cr-at-eol    ignore carrier-return at the end of line
    --ignore-blank-lines  ignore changes whose lines are all blank
    -I, --ignore-matching-lines <regex>
                          ignore changes whose all lines match <regex>
    --indent-heuristic    heuristic to shift diff hunk boundaries for easy reading
    --patience            generate diff using the "patience diff" algorithm
    --histogram           generate diff using the "histogram diff" algorithm
    --diff-algorithm <algorithm>
                          choose a diff algorithm
    --anchored <text>     generate diff using the "anchored diff" algorithm
    --word-diff[=<mode>]  show word diff, using <mode> to delimit changed words
    --word-diff-regex <regex>
                          use <regex> to decide what a word is
                          equivalent to --word-diff=color --word-diff-regex=<regex>
                          moved lines of code are colored differently
    --color-moved-ws <mode>
                          how white spaces are ignored in --color-moved

Other diff options
                          when run from subdir, exclude changes outside and show relative paths
    -a, --text            treat all files as text
    -R                    swap two inputs, reverse the diff
    --exit-code           exit with 1 if there were differences, 0 otherwise
    --quiet               disable all output of the program
    --ext-diff            allow an external diff helper to be executed
    --textconv            run external text conversion filters when comparing binary files
                          ignore changes to submodules in the diff generation
                          specify how differences in submodules are shown
                          hide 'git add -N' entries from the index
                          treat 'git add -N' entries as real in the index
    -S <string>           look for differences that change the number of occurrences of the specified string
    -G <regex>            look for differences that change the number of occurrences of the specified regex
    --pickaxe-all         show all changes in the changeset with -S or -G
    --pickaxe-regex       treat <string> in -S as extended POSIX regular expression
    -O <file>             control the order in which files appear in the output
    --rotate-to <path>    show the change in the specified path first
    --skip-to <path>      skip the output to the specified path
    --find-object <object-id>
                          look for differences that change the number of occurrences of the specified object
    --diff-filter [(A|C|D|M|R|T|U|X|B)...[*]]
                          select files by diff type
    --output <file>       Output to a specific file

[FAIL]  Failed to fetch version from local git: fatal: detected dubious ownership in repository at '/opt/librenms'
To add an exception for this directory, call:

        git config --global --add /opt/librenms
[WARN]  Your local git branch is not master, this will prevent automatic updates.
        You can switch back to master with git checkout master
[FAIL]  You need to run this script as 'librenms' or root
XXXX@librenms:/opt/librenms$ sudo ./validate.php
[sudo] password for XXXX: 
Do not run validate.php as root

We are getting the following in the logfile, mariadb gets an oom error and kills the process. Have to reboot to get things working.

May  9 09:00:28 librenms kernel: [161562.179900] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=php8.1-fpm.service,mems_allowed=0,global_oom,task_memcg=/system.slice/mariadb.service,task=mariadbd,pid=2866,uid=112
May  9 09:00:28 librenms kernel: [161562.183195] Out of memory: Killed process 2866 (mariadbd) total-vm:2754808kB, anon-rss:122352kB, file-rss:0kB, shmem-rss:0kB, UID:112 pgtables:1032kB oom_score_adj:0
May  9 09:00:34 librenms systemd[1]: mariadb.service: A process of this unit has been killed by the OOM killer.
May  9 09:00:39 librenms systemd[1]: mariadb.service: Main process exited, code=killed, status=9/KILL
May  9 09:00:39 librenms systemd[1]: mariadb.service: Failed with result 'oom-kill'.
May  9 09:00:39 librenms systemd[1]: mariadb.service: Consumed 12min 52.269s CPU time.```

From that output I don’t think we have enough info to say MariaDB threw an OOM error here. It’s just that the system’s oom-killer chose the MariaDB process to kill, probably because it had pretty large memory usage and fairly low oom-killer score.

You might be able to figure out what actually caused the OOM condition with more logs or in the kernel logs. Usually somewhere in there you’ll see something invoked oom-killer: blah blah blah and that something is more likely to the app that requested more memory than was available triggering oom-killer. But regardless, the system ran out of memory and MariaDB got killed. If it’s constantly running near max memory utilization and any random usage spike triggers this, then you probably just need to throw more memory or swap at it.