Alert Rule SSD Wearout

DerDanilo · 1 May 2022 20:28

I am trying to write an alert rule for SSD wearout metrics gathered with the smart script.

So far this rule is not triggered at all. There is a test host with a NVMe that is currently at 85% or 15% in terms of the indicator “SSD Life Left”.

application_metrics.value <= 20 AND (application_metrics.metric REGEXP ".*SD Life Lef.*" OR application_metrics.metric LIKE '%_Left' OR application_metrics.app_id = 231)

Any idea on how to properly write this rule?

arrmo · 1 May 2022 22:08

Funny, but I was just digging in to the same thing . I checked the database, and it looks like application_metrics.metric ends with _id231 … search for this, and then check the value vs. 100?

DerDanilo · 2 May 2022 09:18

That worked.

But how to tell the system to ignore if there is -nan? This is true for HDDs.

We could build a rule to just ignore everything that is 0 and below. But this wouldn’t tell us if a disk reached 0 % of calculated SSD Life Left (only until 1% is left).

Warning
application_metrics.metric LIKE '%_id231' AND application_metrics.value <= 20 AND application_metrics.value > 0

Critical
application_metrics.metric LIKE '%_id231' AND application_metrics.value <= 5 AND application_metrics.value > 0

Any idea on how to detect if it’s a HDD?

If the SMART output is ‘NULL’ it still puts ‘0’ to the DB. Hence we cannot check that. Referencing previous values doesn’t work either since they might not be there after changing a disk (when using serial numbers).
Combining to check e.g. for Airflow Temp (not existing in NVMes) doesn’t work either since I don’t think there is a way to have the check reference exactly the same disk Airflow Temp value, it could be any value.

I wish there was a plugin and a separate APP for SSD/NVMe wearout and spare alone.

arrmo · 2 May 2022 23:31

Yes, agreed on the wearout app. I checked my drives (in LibreNMS), and the HDD’s are all 0 (like you say), and of course this could be the SSD value as well (i.e. it could get to zero). I’m not seeing a good way to detect the device type

DerDanilo · 3 May 2022 08:11

We would need an app that uses either uses the existing SMART data but checks specific values that only exist in HDD or SSD/NVMe and the shows alerts.
Or a new app that does this entirely different, e.g. using nvme-cli via an snmp extension script.

@Community: Please don’t ask us to submit a PR if we want this feature. I am no programmer nor do I have enough experience to write such tools. We can help with required feature design and testing though.

We could have one tool for wearout, spare left and flash drive health in general. The tool should use smart data and (smart)data from tools like nvme-cli to provide better device support.

Idea for flash media:

For SSDs:

Use smart data but do checks to detect SSDs and NVMes (Check e.g. if any value exists that is only provided from an HDD)

For NVMes (optional):
nvme smart-log -H -o json /dev/nvme0n1

Use nvme-cli via snmp extension
Use smart data from nvme-cli instead of smartctl data
Have better support for values like “spare left” via nvme-cli that smart data doesn’t provide in some cases

@arrmo What do you think?

ds-04 · 3 May 2022 20:20

See here for dell idrac mechanisms

https://www.dell.com/support/manuals/en-uk/oth-r6515/idrac9_5.00.00.00_ug/ssd-wear-threshold?guid=guid-7865f46e-67c4-460d-9786-b6b9c181ee57&lang=en-us

Assume other vendors probably have something similar

arrmo · 3 May 2022 23:21

Yes, very much like that - and that’s a very helpful nvme command, thanks! Of course, other SSD’s should be supported as well (i.e. SATA). I just checked here, and my (840 EVO) drive doesn’t include 231 - makes it a bit more messy, agreed?

DerDanilo · 4 May 2022 05:08

There are some devices that only support other values. For EVOs it seems that you want to check id177 . A value of 95 there seems to mean 5% wearout.

What is more important is the available spare or replaced sectors values I think. We would need an application which knows about common values to check for ssd/nvme media health based on collected data from multiple sources but for the same device.

DerDanilo · 8 May 2022 17:36

Should be something like:

application_metrics.value <= 20 AND application_metrics.value > 0 AND (application_metrics.metric LIKE '%_id177' OR application_metrics.metric LIKE '%_id231')

system · 6 August 2022 17:37

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.