Urgent Help - LibreNMS mysql won't start automatically

kalamchi75 · 13 January 2020 15:08

Hi Guys,

Today, we started having an issue with our master LibreNMS server mysql won’t start, and needs to be started manually.
Once started, the GUI page would come up, but very slow and not responsive.

Running ./validate.php takes forever.

However, Running ./validate.php from one of the slave LibreNMS servers shows the following:

====================================
Component | Version
--------- | -------
LibreNMS  | 1.58.1-52-g5015a49b6
DB Schema | 2020_01_09_1300_migrate_devices_attribs_table (153)
PHP       | 7.4.1
MySQL     | 10.1.43-MariaDB-0ubuntu0.18.04.1
RRDTool   | 1.7.0
SNMP      | NET-SNMP 5.7.3
====================================

[OK]    Composer Version: 1.9.1
[OK]    Dependencies up-to-date.
[OK]    Database connection successful
[WARN]  Your database schema has extra migrations (2019_10_21_105350_devices_group_perms,   
2019_11_30_191013_create_mpls_tunnel_ar_hops_table,   
2019_11_30_191013_create_mpls_tunnel_c_hops_table,
2019_12_01_165514_add_indexes_to_mpls_lsp_paths_table,  
2020_01_09_1300_migrate_devices_attribs_table). If you just switched to the stable release from the  
daily release, your database is in between releases and this will be resolved with the next release.
[FAIL]  Database: extra column (devices/disable_notify)
[FAIL]  Database: extra column (mpls_lsp_paths/mplsLspPathTunnelARHopListIndex)
[FAIL]  Database: extra column (mpls_lsp_paths/mplsLspPathTunnelCHopListIndex)
[FAIL]  Database: extra table (devices_group_perms)
[FAIL]  Database: extra table (mpls_tunnel_ar_hops)
[FAIL]  Database: extra table (mpls_tunnel_c_hops)
[FAIL]  We have detected that your database schema may be wrong, please report the following to us on Discord (https://t.libren.ms/discord) or the community site (https://t.libren.ms/5gscd):
        [FIX]:
        Run the following SQL statements to fix.
        SQL Statements:
     ALTER TABLE `devices` DROP `disable_notify`;
     ALTER TABLE `mpls_lsp_paths` DROP `mplsLspPathTunnelARHopListIndex`;
     ALTER TABLE `mpls_lsp_paths` DROP `mplsLspPathTunnelCHopListIndex`;
     DROP TABLE `devices_group_perms`;
     DROP TABLE `mpls_tunnel_ar_hops`;
     DROP TABLE `mpls_tunnel_c_hops`;

Not sure if I have to actually drop those tables.

was there any update today (or the last few days) that might have caused the issue ?
Anybody else see the same issue (starting today) ?

Below is ./validate.php from the master server (didn’t finish yet) :

====================================
Component | Version
--------- | -------
LibreNMS  | 1.59-21-g944f38b7f
DB Schema | 2020_01_09_1300_migrate_devices_attribs_table (153)
PHP       | 7.2.24-0ubuntu0.18.04.1
MySQL     | 10.1.43-MariaDB-0ubuntu0.18.04.1
RRDTool   | 1.7.0
SNMP      | NET-SNMP 5.7.3
====================================

[OK]    Composer Version: 1.9.1
[OK]    Dependencies up-to-date.
[OK]    Database connection successful
[OK]    Database schema correct
[FAIL]  The poller (librenms-master) has not completed within the last 5 minutes, check the cron job.
[FAIL]  The poller (librenms-poller01) has not completed within the last 5 minutes, check the cron job.
[FAIL]  The poller (librenms-poller02) has not completed within the last 5 minutes, check the cron job.
[WARN]  Some devices have not been polled in the last 5 minutes. You may have performance issues.

Kindly URGENT help

Elias · 13 January 2020 15:13

if running systemd you can enable a service with: systemctl enable servicename
to check the status: systemctl status servicename
In the status it should show if autostart is enabled or not

kalamchi75 · 13 January 2020 15:15

Hi Elias,

I don’t think it’s a service enable issue.
we have this server running for long time now, and this issue just started today.
Once I start MySql on the master server, the slave servers seems to be able to connect to it, but showing the warnings with tables as shown above.

thanks

TheGreatDoc · 13 January 2020 15:15

Check your mysql server host for CPU/Mem usage and for disk space

But first make sure you are running same LibreNMS version in your master and your pollers.

From your validate, it looks like you are running different LibreNMS versions.

Elias · 13 January 2020 15:16

If the service crashed and didn’t autostart you should see the mysql logs for more info.

kalamchi75 · 13 January 2020 15:22

Hi,

The master runs LibreNMS | 1.59-21-g944f38b7f (updated today)
while the remote pollers run LibreNMS | 1.58.1-52-g5015a49b6 (i don’t wanna update them until the DB issue is fixed on the master poller.

Memory, CPU, and disk usage doesn’t show any issues.

top - 15:20:57 up 44 min, 2 users, load average: 2.43, 3.44, 4.28
Tasks: 586 total, 1 running, 488 sleeping, 0 stopped, 0 zombie
%Cpu(s): 0.3 us, 0.2 sy, 0.0 ni, 79.0 id, 20.5 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 8167400 total, 4170676 free, 2382852 used, 1613872 buff/cache
KiB Swap: 2097148 total, 2097148 free, 0 used. 5481684 avail Mem

The machine runs in a VM with 8 CPU cores and 8GB RAM.

Disk space is also good

Filesystem Size Used Avail Use% Mounted on
udev 3.9G 0 3.9G 0% /dev
tmpfs 798M 808K 797M 1% /run
/dev/sda1 126G 35G 86G 29% /
tmpfs 3.9G 0 3.9G 0% /dev/shm
tmpfs 5.0M 0 5.0M 0% /run/lock
tmpfs 3.9G 0 3.9G 0% /sys/fs/cgroup
tmpfs 798M 0 798M 0% /run/user/1000

kalamchi75 · 13 January 2020 15:23

after few minutes, the GUI shows an issue:

Elias · 13 January 2020 15:26

You should never run different versions of pollers, all sorts of things might have been changed qua the database schema and also redis/memcached etc.

If you run validate on an old poller it’s logical that it thinks the db schema isn’t correct since that schema was built for a newer version.

kalamchi75 · 13 January 2020 15:33

Hi Elias,

Thanks for the clarification.
However, i’m more concerned with fixing the master server since it polls the majority of our devices (the remote pollers are only for a branch office and polling no more than 20 machines).

The database log shows this line which grabbed my attention:

 2020-01-13 14:46:37 139937723059328 [Note] InnoDB: innodb_empty_free_list_algorithm has been
changed to legacy because of small buffer pool size. In order to use backoff, increase buffer pool at 
least up to 20MB.

would that be an issue ?

Elias · 13 January 2020 15:37

Not large enough InnoDB buffer pool will cause more disc IO, and if you have a lot of iowait on the system it can cause all sorts of issues. But if you don’t see iowait then your io should be able to handle the too small buffer pool. Optimally you would want to have the whole database dataset in your buffer pool so it should be sized accordingly.

I’m not sure how mysql handles underruns so could be that innodb storage engine didn’t handle it well and hence the service was shut.

See here for sizing info: https://scalegrid.io/blog/calculating-innodb-buffer-pool-size-for-your-mysql-server/

TheGreatDoc · 13 January 2020 15:48

Stop remote pollers for a few minutes and check if you still have issues with your main server.

Also, as libre gui sais, check your librenms.log for errors that you think they could be an issue

About CPU Info, is that with the mysql up or down? And as @Elias correctly pointed, check your iowait but with that low CPU usage I dont think it could be an issue

kalamchi75 · 13 January 2020 15:53

I have increased the buffer size and restarted MySql service. The GUI is back but very slow as I can see the CPU is very high now with mainly rrdcached porocesses:

top - 15:51:58 up  1:15,  3 users,  load average: 35.52, 30.64, 19.66
Tasks: 552 total,   1 running, 454 sleeping,   0 stopped,   0 zombie

I will wait for few minutes and see if this will drop slowly.
At this point i’m not sure if the issue has been resolved or not yet.
will update shortly.

TheGreatDoc · 13 January 2020 16:51

check your io with iotop

TheGreatDoc · 14 January 2020 13:54

Did you found/fix your issue?

kalamchi75 · 14 January 2020 14:43

Hi,

Still troubleshooting.
I have reduced the number of concurrent poller-wrapper to 16 in each remote poller and 32 to the main poller as I noticed that MySQL was hitting max connections. I have also disabled the integration with Graylog for the time being, and disabled some modules that I have enabled few weeks ago, and monitoring the server’s load.
I haven’t rebooted the server yet to check if MySQL would start properly or would timeout again and has to be started manually.
First I want to keep it running for a while and see if the aforementioned changes would keep the server load at bay.

Thanks for the follow up man

kalamchi75 · 14 January 2020 14:53

Here is the last 6hrs CPU load of the main Poller (monitored by another test LibreNMS server).

kalamchi75 · 16 January 2020 12:15

Hi Guys,

After updating all our LibreNMS servers (master and distributed pollers) to version :

Component | Version
--------- | -------
LibreNMS  | 1.59-29-g10b42137e
DB Schema | 2019_12_17_151314_add_invert_map_to_alert_rules (154)
PHP       | 7.4.1
MySQL     | 10.1.43-MariaDB-0ubuntu0.18.04.1
RRDTool   | 1.7.0
SNMP      | NET-SNMP 5.7.3

All of them show the following Fail error in validation:

[WARN]  Your database schema has extra migrations    
(2019_12_17_151314_add_invert_map_to_alert_rules). If you just switched to the stable release from    
the daily release, your database is in between releases and this will be resolved with the next release.
[FAIL]  Database: extra column (alert_rules/invert_map)
[FAIL]  We have detected that your database schema may be wrong, please report the following to us
 on Discord (https://t.libren.ms/discord) or the community site (https://t.libren.ms/5gscd):
        [FIX]:
        Run the following SQL statements to fix.
        SQL Statements:
         ALTER TABLE `alert_rules` DROP `invert_map`;

Should I go ahead and run the command above in MySQL ?

kindly advise.

PipoCanaja · 17 January 2020 14:04

Please wait a week or two, this issue will be solved by itself when the invert_map code is merged again.

kalamchi75 · 17 January 2020 15:30

Hi Pipo,

Thanks for the update. I will not do any changes in MySQL from my side then.

Regards

kalamchi75 · 20 January 2020 09:53

Hi,

The warning with MySQL regarding ‘alert_rulesDROPinvert_map’ no longer exists after upgrading to version 1.59-39.
The load is also back to reasonable level since Thursday. See below:

I have rebooted the master poller and MySQL started with no issues.

I think we can close this case.

Thanks to everybody for their support.

Regards