Juniper QFX Routing Engine State Alert

Hi all,

We’ve been experiencing a strange issue ever since this pull request was merged: Improve Junos state sensor discovery by Rosiak · Pull Request #9426 · librenms/librenms · GitHub

All of our single member (non virtual chassis) QFX5100/5110’s have been alerting for routing engine down. It has a “Routing Engine 0” sensor and a “Routing Engine” sensor. The “Routing Engine” sensor is marked as unknown for some reason. It looks like all of the virtual chassis QFX’s have a “Routing Engine 0” and “Routing Engine 1” so I think that’s why they’re not affected. I attached a screenshot showing that.

It does look like “Routing Engine” is a real thing at least on the QFX because I can see it when doing show snmp mib walk commands:

show snmp mib walk jnxFruState
jnxFruState.2.1.1.0 = 6
jnxFruState.2.1.2.0 = 6
jnxFruState.4.1.1.0 = 6
jnxFruState.4.1.2.0 = 6
jnxFruState.4.1.3.0 = 6
jnxFruState.4.1.4.0 = 6
jnxFruState.4.1.5.0 = 6
jnxFruState.7.1.0.0 = 6
jnxFruState.8.1.1.0 = 6
jnxFruState.8.1.2.0 = 2
jnxFruState.8.1.3.0 = 2
jnxFruState.8.1.4.0 = 2
jnxFruState.9.1.0.0 = 6
jnxFruState.9.2.0.0 = 1

show snmp mib walk jnxFruName
jnxFruName.2.1.1.0 = Power Supply 0 @ 0/0/*
jnxFruName.2.1.2.0 = Power Supply 1 @ 0/1/*
jnxFruName.4.1.1.0 = Fan Tray 0 @ 0/0/*
jnxFruName.4.1.2.0 = Fan Tray 1 @ 0/1/*
jnxFruName.4.1.3.0 = Fan Tray 2 @ 0/2/*
jnxFruName.4.1.4.0 = Fan Tray 3 @ 0/3/*
jnxFruName.4.1.5.0 = Fan Tray 4 @ 0/4/*
jnxFruName.7.1.0.0 = FPC: QFX5100-48S-6Q @ 0/*/*
jnxFruName.8.1.1.0 = PIC: 48x10G-6x40G @ 0/0/*
jnxFruName.8.1.2.0 = PIC:  @ 0/1/*
jnxFruName.8.1.3.0 = PIC:  @ 0/2/*
jnxFruName.8.1.4.0 = PIC:  @ 0/3/*
jnxFruName.9.1.0.0 = Routing Engine 0
jnxFruName.9.2.0.0 = Routing Engine

It looks like this was previously ignored in the old code. We’ve just gone ahead and turned off the check for that under edit, but I’m wondering if there is a more permanent fix.

Thanks!

Hi! Has there ever been a follow-up on this one by some other means? We have the same issue on both EX and QFX switches. Thanks

Although I agree that based on the MIB, the implementation is good:

jnxFruState OBJECT-TYPE
SYNTAX		INTEGER {
	unknown(1),
	empty(2),
	present(3),
	ready(4),
	announceOnline(5),
	online(6),
	anounceOffline(7),
	offline(8),
	diagnostic(9),
	standby(10)
}
MAX-ACCESS	read-only
STATUS		current
DESCRIPTION
	"The current state for this subject."
::= { jnxFruEntry 8 }

It seems like it’s on the Juniper side that they implemented this in a bad way yet again… However, based on that fact, it would be interesting not to have a false positive error because of it

I’ve opened a case with Juniper to try to understand the reason behind this plain “Routing Engine” that doesn’t even show up in the inventory, as well as get their official reasoning behind the fact that it’s reported as “unknown”…

Hopefully I’ll have something meaningful that could best direct the possible next steps.

Thanks

Update:
If there are curious minds, it’s still pending, after escalations, with a PR (problem report) having been created. Hopefully I’ll have some news soon

I noted this too: Spurious failed routing engine from some JunOS devices

jnxFruSlot should be queried to get the valid slot numbers in jnxFruTable - Juniper have this documented for the MX and EX9600, and I’ve found it true for the QFX and EX’s I have (slot number is negative for invalid RE’s).

https://www.juniper.net/documentation/en_US/junos/topics/reference/general/virtual-chassis-mx-series-slot-numbers-for-snmp.html

The old code worked because it filtered out known bad entries - a missing RE will always be called “Routing Engine”, but a present one will be called “Routing Engine X”.

Thanks for your reply and sorry for not seeing it earlier. But indeed, what you have explained is exactly what Juniper has just finally confirmed to me:

So indeed, basically, if it’s VC capable but isn’t set up in a VC, it’ll show an “unnumbered” and “unknown” RE.

As you also mentioned, we could indeed fix this by validating jnxFruSlot during the discovery process, since it would not give a positive value, as per the example below, taken from a QFX5100:

JUNIPER-MIB::jnxFruName.9.1.0.0 = STRING: Routing Engine 0
JUNIPER-MIB::jnxFruName.9.2.0.0 = STRING: Routing Engine
JUNIPER-MIB::jnxFruSlot.9.1.0.0 = INTEGER: 0
JUNIPER-MIB::jnxFruSlot.9.2.0.0 = INTEGER: -1
JUNIPER-MIB::jnxFruState.9.1.0.0 = INTEGER: online(6)
JUNIPER-MIB::jnxFruState.9.2.0.0 = INTEGER: unknown(1)

The same also applies for EX switches that are not in a VC:

JUNIPER-MIB::jnxFruName.9.1.0.0 = STRING: Routing Engine 0
JUNIPER-MIB::jnxFruName.9.2.0.0 = STRING: Routing Engine
JUNIPER-MIB::jnxFruState.9.1.0.0 = INTEGER: online(6)
JUNIPER-MIB::jnxFruState.9.2.0.0 = INTEGER: unknown(1)
JUNIPER-MIB::jnxFruSlot.9.1.0.0 = INTEGER: 0
JUNIPER-MIB::jnxFruSlot.9.2.0.0 = INTEGER: -1

Hope this helps!

At least there’s a chance that they will clarify this on their end!

When I asked the following:

Could it be considered “safe” and consistent to do a validation of the “unnumbered” RE by looking at its jnxFruSlot value and that if it’s “-1”, it’s safe to discard it?

This is what they answered:

-1 is defined as null which can be equivalent to unknown for snmp walks. In the given scenario that there is no physical RE installed (or linecard/backup), it’s safe to assume that -1 also means unknown as it does for the QFX.

And they added this in regards to the overall information about the situation:

I’m not finding a KB on this so I will write one up – this will take a a month or two for it to be approved but there will eventually be a KB on this. I’ve also requested that this is added to the pathfinder mib walk finder.

This will be KB34356, once published