Today, we started having an issue with our master LibreNMS server mysql won’t start, and needs to be started manually.
Once started, the GUI page would come up, but very slow and not responsive.
Running ./validate.php takes forever.
However, Running ./validate.php from one of the slave LibreNMS servers shows the following:
====================================
Component | Version
--------- | -------
LibreNMS | 1.58.1-52-g5015a49b6
DB Schema | 2020_01_09_1300_migrate_devices_attribs_table (153)
PHP | 7.4.1
MySQL | 10.1.43-MariaDB-0ubuntu0.18.04.1
RRDTool | 1.7.0
SNMP | NET-SNMP 5.7.3
====================================
[OK] Composer Version: 1.9.1
[OK] Dependencies up-to-date.
[OK] Database connection successful
[WARN] Your database schema has extra migrations (2019_10_21_105350_devices_group_perms,
2019_11_30_191013_create_mpls_tunnel_ar_hops_table,
2019_11_30_191013_create_mpls_tunnel_c_hops_table,
2019_12_01_165514_add_indexes_to_mpls_lsp_paths_table,
2020_01_09_1300_migrate_devices_attribs_table). If you just switched to the stable release from the
daily release, your database is in between releases and this will be resolved with the next release.
[FAIL] Database: extra column (devices/disable_notify)
[FAIL] Database: extra column (mpls_lsp_paths/mplsLspPathTunnelARHopListIndex)
[FAIL] Database: extra column (mpls_lsp_paths/mplsLspPathTunnelCHopListIndex)
[FAIL] Database: extra table (devices_group_perms)
[FAIL] Database: extra table (mpls_tunnel_ar_hops)
[FAIL] Database: extra table (mpls_tunnel_c_hops)
[FAIL] We have detected that your database schema may be wrong, please report the following to us on Discord (https://t.libren.ms/discord) or the community site (https://t.libren.ms/5gscd):
[FIX]:
Run the following SQL statements to fix.
SQL Statements:
ALTER TABLE `devices` DROP `disable_notify`;
ALTER TABLE `mpls_lsp_paths` DROP `mplsLspPathTunnelARHopListIndex`;
ALTER TABLE `mpls_lsp_paths` DROP `mplsLspPathTunnelCHopListIndex`;
DROP TABLE `devices_group_perms`;
DROP TABLE `mpls_tunnel_ar_hops`;
DROP TABLE `mpls_tunnel_c_hops`;
Not sure if I have to actually drop those tables.
was there any update today (or the last few days) that might have caused the issue ?
Anybody else see the same issue (starting today) ?
Below is ./validate.php from the master server (didn’t finish yet) :
====================================
Component | Version
--------- | -------
LibreNMS | 1.59-21-g944f38b7f
DB Schema | 2020_01_09_1300_migrate_devices_attribs_table (153)
PHP | 7.2.24-0ubuntu0.18.04.1
MySQL | 10.1.43-MariaDB-0ubuntu0.18.04.1
RRDTool | 1.7.0
SNMP | NET-SNMP 5.7.3
====================================
[OK] Composer Version: 1.9.1
[OK] Dependencies up-to-date.
[OK] Database connection successful
[OK] Database schema correct
[FAIL] The poller (librenms-master) has not completed within the last 5 minutes, check the cron job.
[FAIL] The poller (librenms-poller01) has not completed within the last 5 minutes, check the cron job.
[FAIL] The poller (librenms-poller02) has not completed within the last 5 minutes, check the cron job.
[WARN] Some devices have not been polled in the last 5 minutes. You may have performance issues.
if running systemd you can enable a service with: systemctl enable servicename
to check the status: systemctl status servicename
In the status it should show if autostart is enabled or not
I don’t think it’s a service enable issue.
we have this server running for long time now, and this issue just started today.
Once I start MySql on the master server, the slave servers seems to be able to connect to it, but showing the warnings with tables as shown above.
The master runs LibreNMS | 1.59-21-g944f38b7f (updated today)
while the remote pollers run LibreNMS | 1.58.1-52-g5015a49b6 (i don’t wanna update them until the DB issue is fixed on the master poller.
Memory, CPU, and disk usage doesn’t show any issues.
Thanks for the clarification.
However, i’m more concerned with fixing the master server since it polls the majority of our devices (the remote pollers are only for a branch office and polling no more than 20 machines).
The database log shows this line which grabbed my attention:
2020-01-13 14:46:37 139937723059328 [Note] InnoDB: innodb_empty_free_list_algorithm has been
changed to legacy because of small buffer pool size. In order to use backoff, increase buffer pool at
least up to 20MB.
Not large enough InnoDB buffer pool will cause more disc IO, and if you have a lot of iowait on the system it can cause all sorts of issues. But if you don’t see iowait then your io should be able to handle the too small buffer pool. Optimally you would want to have the whole database dataset in your buffer pool so it should be sized accordingly.
I’m not sure how mysql handles underruns so could be that innodb storage engine didn’t handle it well and hence the service was shut.
Stop remote pollers for a few minutes and check if you still have issues with your main server.
Also, as libre gui sais, check your librenms.log for errors that you think they could be an issue
About CPU Info, is that with the mysql up or down? And as @Elias correctly pointed, check your iowait but with that low CPU usage I dont think it could be an issue
I have increased the buffer size and restarted MySql service. The GUI is back but very slow as I can see the CPU is very high now with mainly rrdcached porocesses:
I will wait for few minutes and see if this will drop slowly.
At this point i’m not sure if the issue has been resolved or not yet.
will update shortly.
Still troubleshooting.
I have reduced the number of concurrent poller-wrapper to 16 in each remote poller and 32 to the main poller as I noticed that MySQL was hitting max connections. I have also disabled the integration with Graylog for the time being, and disabled some modules that I have enabled few weeks ago, and monitoring the server’s load.
I haven’t rebooted the server yet to check if MySQL would start properly or would timeout again and has to be started manually.
First I want to keep it running for a while and see if the aforementioned changes would keep the server load at bay.
After updating all our LibreNMS servers (master and distributed pollers) to version :
Component | Version
--------- | -------
LibreNMS | 1.59-29-g10b42137e
DB Schema | 2019_12_17_151314_add_invert_map_to_alert_rules (154)
PHP | 7.4.1
MySQL | 10.1.43-MariaDB-0ubuntu0.18.04.1
RRDTool | 1.7.0
SNMP | NET-SNMP 5.7.3
All of them show the following Fail error in validation:
[WARN] Your database schema has extra migrations
(2019_12_17_151314_add_invert_map_to_alert_rules). If you just switched to the stable release from
the daily release, your database is in between releases and this will be resolved with the next release.
[FAIL] Database: extra column (alert_rules/invert_map)
[FAIL] We have detected that your database schema may be wrong, please report the following to us
on Discord (https://t.libren.ms/discord) or the community site (https://t.libren.ms/5gscd):
[FIX]:
Run the following SQL statements to fix.
SQL Statements:
ALTER TABLE `alert_rules` DROP `invert_map`;
Should I go ahead and run the command above in MySQL ?
The warning with MySQL regarding ‘alert_rulesDROPinvert_map’ no longer exists after upgrading to version 1.59-39.
The load is also back to reasonable level since Thursday. See below: