We have a number of Intel servers running Windows Server that I monitor using both the Windows built in SNMP server and the motherboards BMC. (via IPMI support in LibreNMS)
IPMI works well and provides sensor data like temperature sensors, fan speeds, voltages, power consumption and percentage of maximum power supply output:
However one set of sensors that is not polled (and/or exposed) is power supply status.
In other words if one power supply fails or loses power there are no sensors in LibreNMS to show this or set up alerts for.
Looking for a Power in figure of zero for an alert is not workable as most of the servers seem to operate the PSU’s in hot standby mode rather than load balancing. So one power supply supplies the full load for about a day, with the other showing zero watts then they swap and so on.
Checking with ipmitool the BMC does provide power supply heath status over IPMI. For example here is all the PSU related data returned from ipmitool with both power supplies powered:
PS1 Status | 0x0 | discrete | 0x0100| na | na | na | na | na | na
PS2 Status | 0x0 | discrete | 0x0100| na | na | na | na | na | na
PS1 Power In | 136.000 | Watts | ok | na | na | na | 868.000 | 920.000 | na
PS2 Power In | 4.000 | Watts | ok | na | na | na | 868.000 | 920.000 | na
PS1 Curr Out % | 16.000 | percent | ok | na | na | na | 100.000 | 112.000 | na
PS2 Curr Out % | 0.000 | percent | ok | na | na | na | 100.000 | 112.000 | na
PS1 Temperature | 31.000 | degrees C | ok | na | na | na | 60.000 | 65.000 | na
PS2 Temperature | 36.000 | degrees C | ok | na | na | na | 60.000 | 65.000 | na
PS1 Fan Fail | 0x0 | discrete | 0x0000| na | na | na | na | na | na
PS2 Fan Fail | 0x0 | discrete | 0x0000| na | na | na | na | na | na
And here is the same data with PS1 unplugged:
PS1 Status | 0x0 | discrete | 0x0900| na | na | na | na | na | na
PS2 Status | 0x0 | discrete | 0x0100| na | na | na | na | na | na
PS1 Power In | 0.000 | Watts | ok | na | na | na | 868.000 | 920.000 | na
PS2 Power In | 140.000 | Watts | ok | na | na | na | 868.000 | 920.000 | na
PS1 Curr Out % | 0.000 | percent | ok | na | na | na | 100.000 | 112.000 | na
PS2 Curr Out % | 16.000 | percent | ok | na | na | na | 100.000 | 112.000 | na
PS1 Temperature | 31.000 | degrees C | ok | na | na | na | 60.000 | 65.000 | na
PS2 Temperature | 35.000 | degrees C | ok | na | na | na | 60.000 | 65.000 | na
PS1 Fan Fail | 0x0 | discrete | 0x0000| na | na | na | na | na | na
PS2 Fan Fail | 0x0 | discrete | 0x0000| na | na | na | na | na | na
The third hex field changes from 0x0100 for good to 0x0900 for faulty/unpowered.
However LibreNMS’s IPMI poller doesn’t seem to do anything with this data.
Is this a bug or just something that hasn’t been implemented at the moment ?
I have seen a couple of other threads on here with similar issues but they’re all old locked threads now that didn’t have any resolution.
The particular server I’m testing on above is an old Intel S2400GP which I’ve kept as a test bench machine, however I see the same behaviour on newer Intel boards that have basically the same BMC.
Being able to set up alerts for PSU failure is of course quite desirable.
