[bug] iDRAC sensor discovery

Recently we have updated from 1.48.1 to 1.51. As soon as upgrade has finished and first scheduled discovery broke out, we started facing an issue for all of our Dell iDRAC devices.

In includes/discovery/sensors/state/drac.inc.php there is:

        if ($value_oid == 'virtualDiskState') {
            $states = [
                ['value' => 1, 'generic' => 3, 'graph' => 0, 'descr' => 'unknown'],
                ['value' => 2, 'generic' => 0, 'graph' => 0, 'descr' => 'online'],
                ['value' => 3, 'generic' => 2, 'graph' => 0, 'descr' => 'failed'],
                ['value' => 4, 'generic' => 1, 'graph' => 0, 'descr' => 'degraded'],
            ];
        }

In includes/discovery/sensors/state/dell.inc.php there is:

        elseif ($state_name == 'virtualDiskState') {
            $states = [
                ['value' => 0, 'generic' => 3, 'graph' => 0, 'descr' => 'unknown'],
                ['value' => 1, 'generic' => 0, 'graph' => 1, 'descr' => 'ready'],
                ['value' => 2, 'generic' => 2, 'graph' => 1, 'descr' => 'failed'],
                ['value' => 3, 'generic' => 1, 'graph' => 1, 'descr' => 'online'],
                ['value' => 4, 'generic' => 2, 'graph' => 1, 'descr' => 'offline'],
                ['value' => 6, 'generic' => 2, 'graph' => 1, 'descr' => 'degraded'],
                ['value' => 7, 'generic' => 1, 'graph' => 1, 'descr' => 'verifying'],
                ['value' => 15, 'generic' => 1, 'graph' => 1, 'descr' => 'resynching'],
                ['value' => 16, 'generic' => 1, 'graph' => 1, 'descr' => 'regenerating'],
                ['value' => 18, 'generic' => 2, 'graph' => 1, 'descr' => 'failedRedundancy'],
                ['value' => 24, 'generic' => 1, 'graph' => 1, 'descr' => 'rebuilding'],
                ['value' => 26, 'generic' => 1, 'graph' => 1, 'descr' => 'formatting'],
                ['value' => 32, 'generic' => 1, 'graph' => 1, 'descr' => 'reconstructing'],
                ['value' => 35, 'generic' => 1, 'graph' => 1, 'descr' => 'initializing'],
                ['value' => 36, 'generic' => 1, 'graph' => 1, 'descr' => 'backgroundInit'],
                ['value' => 52, 'generic' => 2, 'graph' => 1, 'descr' => 'permanentlyDegraded'],
            ];
        }

Specifically, failed/online states use different numbered represenations (also as per MIBs, so no offence here).

So after an upgrade, as soon as discovery runs through, every virtual disk will go to failed state.
The following patch solves an issue:

--- dell_old.inc.php    2019-05-24 15:44:51.323010700 +0300
+++ dell.inc.php        2019-05-24 15:45:11.233443600 +0300
@@ -18,7 +18,6 @@
     ['intrusionTable','.1.3.6.1.4.1.674.10892.1.300.70.1.5.','intrusionStatus','Intrusion','MIB-Dell-10892'],
     ['controllerTable','.1.3.6.1.4.1.674.10893.1.20.130.1.1.5.','controllerState','controllerName','StorageManagement-MIB'],
     ['arrayDiskTable','.1.3.6.1.4.1.674.10893.1.20.130.4.1.4.','arrayDiskState','arrayDiskName','StorageManagement-MIB'],
-    ['virtualDiskTable','.1.3.6.1.4.1.674.10893.1.20.140.1.1.4.','virtualDiskState','virtualDiskDeviceName','StorageManagement-MIB'],
     ['batteryTable','.1.3.6.1.4.1.674.10893.1.20.130.15.1.4.','batteryState','batteryName','StorageManagement-MIB'],
 ];

@@ -72,25 +71,6 @@
                 ['value' => 22, 'generic' => 2, 'graph' => 0, 'descr' => 'incompatible'],
                 ['value' => 23, 'generic' => 2, 'graph' => 0, 'descr' => 'readOnly'],
             ];
-        } elseif ($state_name == 'virtualDiskState') {
-            $states = [
-                ['value' => 0, 'generic' => 3, 'graph' => 0, 'descr' => 'unknown'],
-                ['value' => 1, 'generic' => 0, 'graph' => 1, 'descr' => 'ready'],
-                ['value' => 2, 'generic' => 2, 'graph' => 1, 'descr' => 'failed'],
-                ['value' => 3, 'generic' => 1, 'graph' => 1, 'descr' => 'online'],
-                ['value' => 4, 'generic' => 2, 'graph' => 1, 'descr' => 'offline'],
-                ['value' => 6, 'generic' => 2, 'graph' => 1, 'descr' => 'degraded'],
-                ['value' => 7, 'generic' => 1, 'graph' => 1, 'descr' => 'verifying'],
-                ['value' => 15, 'generic' => 1, 'graph' => 1, 'descr' => 'resynching'],
-                ['value' => 16, 'generic' => 1, 'graph' => 1, 'descr' => 'regenerating'],
-                ['value' => 18, 'generic' => 2, 'graph' => 1, 'descr' => 'failedRedundancy'],
-                ['value' => 24, 'generic' => 1, 'graph' => 1, 'descr' => 'rebuilding'],
-                ['value' => 26, 'generic' => 1, 'graph' => 1, 'descr' => 'formatting'],
-                ['value' => 32, 'generic' => 1, 'graph' => 1, 'descr' => 'reconstructing'],
-                ['value' => 35, 'generic' => 1, 'graph' => 1, 'descr' => 'initializing'],
-                ['value' => 36, 'generic' => 1, 'graph' => 1, 'descr' => 'backgroundInit'],
-                ['value' => 52, 'generic' => 2, 'graph' => 1, 'descr' => 'permanentlyDegraded'],
-            ];
         } elseif ($state_name == 'batteryState') {
             $states = [
                 ['value' => 1, 'generic' => 0, 'graph' => 0, 'descr' => 'ready'],

Unfortunately, I could not find a way to resolve this in code. iDRAC devices contain sensors from both DRAC and DELL MIBS, so there is no point in dropping DELL MIBS for DRAC devices. So there should be another approach used, which I could not figure out…

It should be following the https://docs.librenms.org/Support/Device-Sensors/ state sensors.

There is no issue with mapping itself. The issue is that:
DRAC value 2 for virtualDiskState = online
DELL value 2 for virtualDiskState = failed.

Both mappings are used in discovery, but they are conflicting each other. Again, this emerged only after an upgrade.

Those definitions seem correct as per the mibs. Those are two different sensors with different definitions.

I don’t understand why you are referencing both when referring to one device, one device can only have one OS.

After one and a half weeks of searching, finally tracked down where do the false alerts come from. My initial report was very near, but does not point out issue correctly. Issue root is in the name of sensor mapping, which is identical in both drac.inc.php and dell.inc.php: virtualDiskState. When two different devices are being discovered, they store different information into state_translations table.
In our case, I found it happens when Dell PowerConnect 5548 device is being discovered.

Before Dell PowerConnect 5548 discovery:

MariaDB [librenms]> select * from state_translations where state_index_id=28530;
+----------------------+----------------+-------------+------------------+-------------+---------------------+---------------------+
| state_translation_id | state_index_id | state_descr | state_draw_graph | state_value | state_generic_value | state_lastupdated   |
+----------------------+----------------+-------------+------------------+-------------+---------------------+---------------------+
|                  141 |          28530 | unknown     |                0 |           1 |                   3 | 2019-06-03 14:13:00 |
|                  142 |          28530 | online      |                0 |           2 |                   0 | 2019-06-03 14:13:00 |
|                  143 |          28530 | failed      |                0 |           3 |                   2 | 2019-06-03 14:13:00 |
|                  144 |          28530 | degraded    |                0 |           4 |                   1 | 2019-06-03 14:13:00 |
+----------------------+----------------+-------------+------------------+-------------+---------------------+---------------------+

MariaDB [librenms]> select * from state_indexes where state_index_id=28530;
+----------------+------------------+
| state_index_id | state_name       |
+----------------+------------------+
|          28530 | virtualDiskState |
+----------------+------------------+

MariaDB [librenms]> select count(*) from sensors_to_state_indexes where state_index_id=28530;
+----------+
| count(*) |
+----------+
|      107 |
+----------+

MariaDB [librenms]> select * from sensors where sensor_id=3171 \G;
*************************** 1. row ***************************
                sensor_id: 3171
           sensor_deleted: 0
             sensor_class: state
                device_id: 836
              poller_type: snmp
               sensor_oid: .1.3.6.1.4.1.674.10892.5.5.1.20.140.1.1.4.1
             sensor_index: 1
              sensor_type: virtualDiskState
             sensor_descr: Virtual Disk 0
                    group: NULL
           sensor_divisor: 1
        sensor_multiplier: 1
           sensor_current: 2
             sensor_limit: NULL
        sensor_limit_warn: NULL
         sensor_limit_low: NULL
    sensor_limit_low_warn: NULL
             sensor_alert: 1
            sensor_custom: No
         entPhysicalIndex: 1
entPhysicalIndex_measured: NULL
               lastupdate: 2019-01-07 23:57:37
              sensor_prev: 0
                user_func: NULL

After Dell PowerConnect 5548 discovery:

MariaDB [librenms]> select * from state_translations where state_index_id=28530;                                                 +----------------------+----------------+---------------------+------------------+-------------+---------------------+---------------------+
| state_translation_id | state_index_id | state_descr         | state_draw_graph | state_value | state_generic_value | state_lastupdated   |
+----------------------+----------------+---------------------+------------------+-------------+---------------------+---------------------+
|                 2602 |          28530 | unknown             |                0 |           0 |                   3 | 2019-06-04 11:25:01 |
|                  141 |          28530 | ready               |                1 |           1 |                   0 | 2019-06-04 11:25:00 |
|                  142 |          28530 | failed              |                1 |           2 |                   2 | 2019-06-04 11:25:00 |
|                  143 |          28530 | online              |                1 |           3 |                   1 | 2019-06-04 11:25:00 |
|                  144 |          28530 | offline             |                1 |           4 |                   2 | 2019-06-04 11:25:00 |
|                 2603 |          28530 | degraded            |                1 |           6 |                   2 | 2019-06-04 11:25:01 |
|                 2604 |          28530 | verifying           |                1 |           7 |                   1 | 2019-06-04 11:25:01 |
|                 2605 |          28530 | resynching          |                1 |          15 |                   1 | 2019-06-04 11:25:01 |
|                 2606 |          28530 | regenerating        |                1 |          16 |                   1 | 2019-06-04 11:25:01 |
|                 2607 |          28530 | failedRedundancy    |                1 |          18 |                   2 | 2019-06-04 11:25:01 |
|                 2608 |          28530 | rebuilding          |                1 |          24 |                   1 | 2019-06-04 11:25:01 |
|                 2609 |          28530 | formatting          |                1 |          26 |                   1 | 2019-06-04 11:25:01 |
|                 2610 |          28530 | reconstructing      |                1 |          32 |                   1 | 2019-06-04 11:25:01 |
|                 2611 |          28530 | initializing        |                1 |          35 |                   1 | 2019-06-04 11:25:01 |
|                 2612 |          28530 | backgroundInit      |                1 |          36 |                   1 | 2019-06-04 11:25:01 |
|                 2613 |          28530 | permanentlyDegraded |                1 |          52 |                   2 | 2019-06-04 11:25:01 |
+----------------------+----------------+---------------------+------------------+-------------+---------------------+---------------------+

MariaDB [librenms]> select * from state_indexes where state_index_id=28530;
+----------------+------------------+
| state_index_id | state_name       |
+----------------+------------------+
|          28530 | virtualDiskState |
+----------------+------------------+

MariaDB [librenms]> select count(*) from sensors_to_state_indexes where state_index_id=28530;
+----------+
| count(*) |
+----------+
|      107 |
+----------+

MariaDB [librenms]> select * from sensors where sensor_id=3171 \G;
*************************** 1. row ***************************
                sensor_id: 3171
           sensor_deleted: 0
             sensor_class: state
                device_id: 836
              poller_type: snmp
               sensor_oid: .1.3.6.1.4.1.674.10892.5.5.1.20.140.1.1.4.1
             sensor_index: 1
              sensor_type: virtualDiskState
             sensor_descr: Virtual Disk 0
                    group: NULL
           sensor_divisor: 1
        sensor_multiplier: 1
           sensor_current: 2
             sensor_limit: NULL
        sensor_limit_warn: NULL
         sensor_limit_low: NULL
    sensor_limit_low_warn: NULL
             sensor_alert: 1
            sensor_custom: No
         entPhysicalIndex: 1
entPhysicalIndex_measured: NULL
               lastupdate: 2019-01-07 23:57:37
              sensor_prev: 0
                user_func: NULL

Switch device sensors+os module discovery in debug when code of /opt/librenms/includes/discovery/sensors/state/dell.inc.php is updated: https://gist.github.com/angryp/3ff8eda52a4db2203d2e2873e94ed6fe
Server device sensors+os module discovery in debug after switch device is discovered: https://gist.github.com/angryp/fa1911d066b800ef5225b56fd0ce88d9

1 Like

Oh! the translations are over-written because they have the same name.

Nice detective work. We just need to rename one or both. (but that will cause the rrd file name to change I think)

Did this situation ever get resolved? I have exactly the same false positives for the same reason.

Hi @PhilipHalton @angryp

A Pull request is now opened in github to solve this. It should not harm any graph (only the state name changes, not the RRDs).

Please test it ( using ./scripts/github-apply 10539) and let me know here (or in github) the results.

Thanx

Hi Pipo

Thank you for looking at this.

I ran the ./scripts/github-apply 10539 and it has applied the patches, but I am not seeing any effect after 24 hours. Both the Datastore and VD1 show red failed status in IDRAC on Librenms.

Do I need to do anything else?

Phil

@PhilipHalton You need to run a discovery of the device. Which should have happened during the night. Can you check if it was run properly ?
We are not really sure that the RRD will not be harmed so be careful and run it on a test instance (or accept to take the risk to loose the historical data in the RRDs)

I didn’t see a discovery listed in the event log so I ran a discovery manually. Unfortunately, nothing has changed. I don’t mind losing RRD data for this one so do you think that deleting the device and re-adding would help?

Phil

We need to patch LibreNMS to correct this bug. The problem is that it will kill make all RRD data disappear for one of the 2 OSes (either drac or dell).

Current PR is useless for the moment. Will keep you updated here.

Hi @PhilipHalton
Could you please try again the #10539 ? The dell device should loose history this time :slight_smile:

Steps :

  • git checkout includes/discovery/sensors/state/dell.inc.php
  • git checkout includes/discovery/sensors/state/drac.inc.php
  • ./scripts/github-apply 10539
  • rm cache/os_defs.cache
  • ./discovery.php -h yourDellDevice
  • ./discovery.php -h yourDracDevice

Let us know :slight_smile:

Ok, I am happy to apply that patch, but I should tell you about my experience so far after applying the last patch. I applied that patch about two weeks ago, and it appeared to make no difference. A few days ago, the issue seemed to cure itself and both data-stores were marked as green.
However, I have just looked at Librenms for that device (most of our Dell servers do not have local storage, so the issue only affects one device). The device is now not showing most of the sensors that it displayed both before and after the patch was applied (including the two data-stores).
The event log is showing that those sensors were deleted earlier today.
(This is a short extract from the event log as an example:

|2019-09-02 01:00:06|sensor|idrac-esxi09|Sensor Deleted: state voltageProbeStatus 1.32 PS2 Voltage 2|System|
|2019-09-02 01:00:06|sensor|idrac-esxi09|Sensor Deleted: state amperageProbeStatus 1.1 PS1 Current 1|System|
|2019-09-02 01:00:06|sensor|idrac-esxi09|Sensor Deleted: state amperageProbeStatus 1.2 PS2 Current 2|System|
)
It looks as if it may be in the process of being rediscovered, so I will leave it 24 hours or so to see what it is doing. The other Dell servers without data-stores are showing their full complement of sensors.

I applied the patch this morning, it doesn’t look to have deleted the history, however about five hours later, LibreNMS has deleted the sensors on most of our servers. One server still shows 75 sensors, the others show four or five only, and do not include local datastores on the original problem server.

Did you follow the suggested procedure, including remove the cache and manually discovering DRAC and DELL device ? If yes, there is no explanation of anything happening 5 hours later. Does the 5 hours match the time at which auto-discovery occurs ?

Hi Pipo

I did exactly as described in your message. Below is an extract from the event log for one server. This is my cron file:

Using this cron file requires an additional user on your system, please see install docs.

SHELL=/bin/bash
PATH=/opt/librenms:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin

33 */6 * * * librenms /opt/librenms/cronic /opt/librenms/discovery-wrapper.py 1
*/5 * * * * librenms /opt/librenms/discovery.php -h new >> /dev/null 2>&1
*/5 * * * * librenms /opt/librenms/cronic /opt/librenms/poller-wrapper.py 16

          • librenms /opt/librenms/alerts.php >> /dev/null 2>&1
            */5 * * * * librenms /opt/librenms/poll-billing.php >> /dev/null 2>&1
            01 * * * * librenms /opt/librenms/billing-calculate.php >> /dev/null 2>&1
            */5 * * * * librenms /opt/librenms/check-services.php >> /dev/null 2>&1

*/5 * * * * librenms /opt/librenms/html/plugins/Weathermap/map-poller.php >> /dev/null $
*/5 * * * * librenms /opt/librenms/services-wrapper.py 1

          • librenms /opt/librenms/ping.php >> /dev/null 2>&1

Daily maintenance script. DO NOT DISABLE!

If you want to modify updates:

Switch to monthly stable release: https://docs.librenms.org/General/Releases/

2019-09-05 12:57:33 sensor idrac-esxi06 Sensor Deleted: state voltageProbeStatus 1.17 CPU2 M01 VTT PG System
2019-09-05 12:57:33 sensor idrac-esxi06 Sensor Deleted: state voltageProbeStatus 1.18 System Board NDC PG System
2019-09-05 12:57:33 sensor idrac-esxi06 Sensor Deleted: state voltageProbeStatus 1.19 CPU2 M01 VPP PG System
2019-09-05 12:57:33 sensor idrac-esxi06 Sensor Deleted: state voltageProbeStatus 1.20 CPU1 M01 VPP PG System
2019-09-05 12:57:33 sensor idrac-esxi06 Sensor Deleted: state voltageProbeStatus 1.21 CPU2 M23 VDDQ PG System
2019-09-05 12:57:33 sensor idrac-esxi06 Sensor Deleted: state voltageProbeStatus 1.22 System Board 1.5V PG System
2019-09-05 12:57:33 sensor idrac-esxi06 Sensor Deleted: state voltageProbeStatus 1.23 System Board 1.5V AUX PG System
2

State sensor CPU1 M23 VDDQ PG has changed from (3) to (0) System
2019-09-05 12:57:01 State idrac-esxi06 State sensor CPU1 M23 VTT PG has changed from (3) to (0) System
2019-09-05 12:57:01 State idrac-esxi06 State sensor System Board 5V SWITCH PG has changed from (3) to (0) System
2019-09-05 12:57:01 State idrac-esxi06 State sensor System Board DIMM PG has changed from (3) to (0) System
2019-09-05 12:57:01 State idrac-esxi06 State sensor System Board VCCIO PG has changed from (3) to (0) System
2019-09-05 12:56:25 system idrac-esxi06 OS Version -> System
2019-09-05 12:56:25 system idrac-esxi06 Hardware -> System
2019-09-05 12:56:25 system idrac-esxi06 Serial -> System
2019-09-05 12:56:06 system idrac-esxi06 sysContact -> System
2019-09-05 12:56:06 system idrac-esxi06 sysObjectID -> System
2019-09-05 12:56:06 system idrac-esxi06 sysName -> System
2019-09-05 07:06:39 sensor idrac-esxi06 Sensor Added: state voltageProbeStatus 1.1 CPU1 VCORE PG System
2019-09-05 07:06:39 sensor idrac-esxi06 Sensor Added: state voltageProbeStatus 1.2 CPU2 VCORE PG System
2019-09-05 07:06:39 sensor idrac-esxi06 Sensor Added: state voltageProbeStatus 1.3 System Board 3.3V PG System
2019-09-05 07:06:39 sensor idrac-esxi06 Sensor Added: state voltageProbeStatus 1.4 System Board 5V AUX PG System

I missed the end of the cron file:

15 0 * * * librenms /opt/librenms/daily.sh >> /dev/null 2>&1

Can you give us a few more details :

  • Time of patch application
  • Time when discovery was manually run on this device

The sensors are deleted during a discovery process only. So if this is not the cron discovery (which occurs at 6.33 in your case) did you manually run a discovery around 1 PM ? What was the exact command line run ?

I used su librenms to change user and then:
git checkout includes/discovery/sensors/state/dell.inc.php
git checkout includes/discovery/sensors/state/drac.inc.php
./scripts/github-apply 10539
rm cache/os_defs.cache
./discovery.php -h 192.168.x.x
./discovery.php -h 192.168.x.x

The commands were run at roughly 9 am. I haven’t done a full discovery, but I did add a new device at about 1pm using the GUI.

I have just run the commands again, so I will see what happens. I won’t do anything else to the system.