Discovery not completing [Solved]

Hi all,

I’m getting this output from validate.php:
[FAIL] Discovery has not completed in the last 24 hours, check the cron job

This suddenly started happening last week I think, no changes had been made.

Here is the output of the /etc/cron.d/librenms file:

# Using this cron file requires an additional user on your system, please see install docs.

#33  */6   * * *   librenms    /opt/librenms/discovery.php -h all >> /dev/null 2>&1
33 */6 * * * librenms /opt/librenms/discovery-wrapper.php 4 >> /dev/null 2>&1
*/5  *    * * *   librenms    /opt/librenms/discovery.php -h new >> /dev/null 2>&1
*/5  *    * * *   librenms    /opt/librenms/cronic /opt/librenms/poller-wrapper.py 16
15   0    * * *   librenms    /opt/librenms/daily.sh >> /dev/null 2>&1
*    *    * * *   librenms    /opt/librenms/alerts.php >> /dev/null 2>&1
*/5  *    * * * librenms /opt/librenms/poll-billing.php >> /dev/null 2>&1
01   *    * * * librenms /opt/librenms/billing-calculate.php >> /dev/null 2>&1
*/5  *    * * * librenms /opt/librenms/check-services.php >> /dev/null 2>&1
*/5 * * * * root /opt/librenms/html/plugins/Weathermap/map-poller.php >> /dev/null 2>&1
15   0    * * * root cd /opt/librenms/scripts && php ./gen_smokeping.php > /etc/smokeping/librenms.conf && /usr/sbin/smokeping --reload >> /dev/null 2>&1

The first line has been there since install, I commented it out and added the 2nd line just to see if it helped. It didn’t. :slight_smile:

Any thoughts or help would be appreciated.

Thanks

Run this mysql query:

SELECT device_id,hostname FROMdevicesWHERElast_discovered<= DATE_ADD(NOW(), INTERVAL - 24 HOUR) ANDignore= 0 ANDdisabled= 0 ANDstatus= 1;

Do you have more devices added than what is returned by that? If it’s the same amount then run (this will take a long time if you have a lot of devices):

./discovery.php -h all -v > /tmp/disco.txt &

If it’s less then run:

./discovery.php -h HOSTNAME -v > /tmp/disco.txt

Replace HOSTNAME with one from the output of the query.

Post the output of tail -10 /tmp/disco.txt from either when done.

Thanks @laf,

I have 119 devices on the All Devices web page and the query returns 116 records.
(I had to remove the ignore = 0 from the sql query as it would not run with it included, however I know that there is only 1 device that is ignored)

Do I still need to run the discovery script?

What version of php?

It could be that one device is causing discovery.php to crash so I’d be inclined to run the full -h all as mentioned above - redirected to a file so we can see what’s gone on. Maybe run it in screen if you can’t stay connected to the shell until it’s finished.

the -h all is running right now, :slight_smile:

PHP version:
PHP | 7.0.15-0ubuntu0.16.04.4

I’ll post the output once it has completed.

Ok, so it seems that the script has not completed, the terminal I ran it from is stuck. The last output in that window is:
ifStackStatus: Unknown Object Identifier (Sub-id not found: (top) -> ifStackStatus)

From another terminal I ran the tail command and got this:

librenms@librenms:~$ tail -10 /tmp/disco.txt
Modules status: Global- OS  Device  Module [ mef ] disabled globally.

SQL[SELECT attrib_value FROM devices_attribs WHERE `device_id` = '1' AND `attrib_type` = 'poll_mib' ]
SQL[UPDATE `devices` set `last_discovered` =NOW(),`last_discovered_timetaken` ='1.801' WHERE `device_id` = '1']
Discovered in 1.801 seconds

SQL[INSERT INTO `perf_times` (`type`,`doing`,`start`,`duration`,`devices`,`poller`)  VALUES ('discover','all','1493106236.4222','1239.','116','librenms\n')]
./discovery.php all 2017-04-25 09:04:36 - 116 devices discovered in 1239. secs
SNMP: Get[1835/101.72s] Walk [4089/1022.22s]
MySQL: Cell[5686/2.68s] Row[2048/0.97s] Rows[4152/2.51s] Column[116/0.08s] Update[3079/1.69s] Insert[394/0.31s] Delete[254/0.12s]

That shows as having completed then so not sure why it’s showing as a problem, does validate show as being ok now?

Strange, just ran validate again and still getting the error.

24 hours later that’s to be expected.

SELECT NOW();
SELECT @@system_time_zone;

Are those both right?

Well, that’s the thing. When I ran it this morning it wasn’t 24 hours since I manually ran discovery.
In fact, I just ran it again and then validated immediately after and got the same warning.

Both of those commands return the correct info, system time zone is BST and the time is correct.

Run UPDATEdevicessetlast_discovered=NOW(),last_discovered_timetaken='1.801' WHEREdevice_id= '1';

OK:

MariaDB [librenms]> UPDATE devices set last_discovered = NOW(), last_discovered_timetaken = '1.801' WHERE device_id= '1';
Query OK, 1 row affected (0.00 sec)
Rows matched: 1  Changed: 1  Warnings: 0

MariaDB [librenms]> exit
Bye
librenms@librenms:~$ sudo ./validate.php
==========================================================
Component | Version
--------- | -------
LibreNMS  | cdd363f29ebcdb361ca207a379ff9fd450608d62
DB Schema | 186
PHP       | 7.0.15-0ubuntu0.16.04.4
MySQL     | 10.0.29-MariaDB-0ubuntu0.16.04.1
RRDTool   | 1.5.5
SNMP      | NET-SNMP 5.7.3
==========================================================

[OK]    Database connection successful
[OK]    Database schema correct
[FAIL]  Discovery has not completed in the last 24 hours, check the cron job

Just browsing the database it seems to be showing the correct data for last_discovered.

So it’s not that discovery is not running, it must be an error with the validate script?

No I’m wrong.

I had a look at the query that validate is running and ran it manually, I got 2 devices that have not completed discovery recently, one for just over a day and one not for a few weeks.

I manually ran discovery against both and now the error is gone.

Odd though, they are not ignored or disabled and one is our domain server, so I know it hasn’t been down.

So it’s back., I’m getting the Discovery has not completed in the last 24 hours, check the cron job error again.

Checking the DB nearly all devices have not discovered since the 26th or 27th of April.

So it must be that the cron job is not running?

Can someone sanity check the cron output above? to me that looks like it should work?

Thanks

Tom

Just a friendly bump…

10 lines of the log file doesn’t really tell us anything… Would be best to post the whole thing.

I would do.
But it contains private info, so I don’t want to post on the forum really.

Then you should replace the private data with generic data.

No problem.

How can I get it to you?
It’s 223,457 lines and pastebin says it’s too large…