Making Physical Memory graphs valid and worthwile

micoots · 5 December 2019 23:01

Hi. I’ve recently created various Dashboards in LibreNMS and many of them show “physical memory” usage near 100%, where the physical machine has nowhere near that type of memory usage.

I’ve reviewed various links and items where this has been discussed in relation to LibreNMS, in Github:

https://github.com/librenms/librenms/issues/3179

https://github.com/librenms/librenms/issues/5660

on the forums here:

https://community.librenms.org/t/memory-pool/8539

and so on.

I personally believe the LibreNMS memory graphs for Linux are wrong and I’ll explain why. Note, I’m not trying to start a debate here, I just want to highlight practicality and why LibreNMS should have this fixed.

Generally speaking, the purpose of a monitoring solution is to alert an admin on problems that need attention. As admins, we typically set thresholds of 80% usage, 90% usage etc and get notifications in various transports, which alert us of things which need attention.

In the case of “memory usage” in Linux, I’m getting 100% usage for many of the Linux servers when in fact they are much less. So in essence, these particular checks and graphs are useless, since they don’t give a real world indication of memory usage and / or memory free on the real Linux server.

So in essence, I don’t agree with people in Github (like @paulgear) for a server monitoring solution, since (again generally speaking) the primary purpose admins would use a server monitoring solution is not to see pretty graphs, but be notified when things need attention. The pretty graphs, IMHO, comes second to that.

We also employ other monitoring solutions (like OMDlabs / Nagios) which give the real memory usage and notify correctly, but my comments here are for LibreNMS, making this product better and useful is the reason I’m posting here.

Michael.

Kevin_Krumm · 6 December 2019 00:19

It would be nice if you can contribute code to help resolve the Issue. That’s what drives librenms is people contributing time and code.

ionline · 6 December 2019 08:13

yeah would love to see a fix for this!

micoots · 9 December 2019 08:42

Hi Kevin. You’re assuming here I’m a coder, I’m not. I’m a sys admin with experience in bash, some perl, etc. I have no idea what LibreNMS is programmed in and how the internals work, how can I possibly code for it?

Kevin_Krumm · 9 December 2019 11:39

I’m saying you don’t have to be a programmer to help. I’m asking you to offer a solution instead of just talk. Allot of us here are not “programmers” but are passionate about Librenms and help volunteer time and code.
All I have heard so far is talk about a possible issue but no solution and an excuse of “I’m a sys admin can’t help.”

PipoCanaja · 9 December 2019 20:06

And most importantly, @micoots, we should remember that snmp is providing LibreNMS the values. So if you don’t like the values you see, you need to ask the Kernel and SNMP to provide better value, not the LibreNMS team at all…

LibreNMS does not “compute” those…

micoots · 9 December 2019 21:54

That’s a good response to drive people out of the community, thanks.

micoots · 9 December 2019 21:55

The issue as I see it is that the SNMP values it’s picking up are the “cached” values, which aren’t the correct ones in this instance.

PipoCanaja · 9 December 2019 22:13

Again, @micoots if the “available” value is not provided by SNMP, there is nothing really we can do in an SNMP monitoring tool
You can have a look here for a quick description of the OID :
http://www.debianadmin.com/linux-snmp-oids-for-cpumemory-and-disk-statistics.html
Then if you want some other value, as @Kevin_Krumm said, we are all network engineers, sysadmin, etc, and none of us, to my knowledge, are developpers of the SNMP linux implementation, nor the kernel memory implementation. We have no other choice than using the available values there.
I can understand you expect another answer, but you have to understand there is no other answer you can get from a community driven project. I am just like you, a LibreNMS user, not a dev, doing this on my free time, helping as much as I can. And I cannot rewrite the linux kernel and the snmpd server right now …

micoots · 9 December 2019 22:42

OK thanks for the reference. I see in that link we have:

Memory Statistics

Total Swap Size: .1.3.6.1.4.1.2021.4.3.0
Available Swap Space: .1.3.6.1.4.1.2021.4.4.0
Total RAM in machine: .1.3.6.1.4.1.2021.4.5.0
Total RAM used: .1.3.6.1.4.1.2021.4.6.0
Total RAM Free: .1.3.6.1.4.1.2021.4.11.0
Total RAM Shared: .1.3.6.1.4.1.2021.4.13.0
Total RAM Buffered: .1.3.6.1.4.1.2021.4.14.0
Total Cached Memory: .1.3.6.1.4.1.2021.4.15.0

So I run an snmpwalk on that node and I see:

UCD-SNMP-MIB::memIndex.0 = INTEGER: 0
UCD-SNMP-MIB::memErrorName.0 = STRING: swap
UCD-SNMP-MIB::memTotalSwap.0 = INTEGER: 8388604 kB
UCD-SNMP-MIB::memAvailSwap.0 = INTEGER: 8207356 kB
UCD-SNMP-MIB::memTotalReal.0 = INTEGER: 16412812 kB
UCD-SNMP-MIB::memAvailReal.0 = INTEGER: 771748 kB
UCD-SNMP-MIB::memTotalFree.0 = INTEGER: 8979104 kB
UCD-SNMP-MIB::memMinimumSwap.0 = INTEGER: 16000 kB
UCD-SNMP-MIB::memShared.0 = INTEGER: 825640 kB
UCD-SNMP-MIB::memBuffer.0 = INTEGER: 136408 kB
UCD-SNMP-MIB::memCached.0 = INTEGER: 4724628 kB
UCD-SNMP-MIB::memSwapError.0 = INTEGER: noError(0)
UCD-SNMP-MIB::memSwapErrorMsg.0 = STRING:

So LibreNMS is picking these two values:

UCD-SNMP-MIB::memTotalReal.0 = INTEGER: 16412812 kB
UCD-SNMP-MIB::memAvailReal.0 = INTEGER: 771748 kB

How can I tell it to pick these values instead:

UCD-SNMP-MIB::memTotalReal.0 = INTEGER: 16412812 kB
UCD-SNMP-MIB::memTotalFree.0 = INTEGER: 8979104 kB

which would make more sense to me ie. graph the free memory instead (memTotalFree) which I could then generate an alert on if it goes down too much.

Thanks.

Michael.

PipoCanaja · 9 December 2019 23:04

According to the doc:

memTotalFree: 
The total amount of memory free or available for use on this host. This value typically covers both real memory and swap space or virtual memory.

So this cannot be used in a “physical” metric, cause it includes both physical and virtual.

The value we use now is exactly the one that provide the expected value for a physical metric.

memAvailReal:
The amount of real/physical memory currently unused or available.

So there is nothing that we can change for a “physical memory” metric.

You could indeed make your own graph that uses :

UCD-SNMP-MIB::memTotalReal.0 = INTEGER: 16412812 kB
UCD-SNMP-MIB::memTotalFree.0 = INTEGER: 8979104 kB

The tool is open source, you can easily change the OID that is polled. But according to the doc, it makes no sense cause totalFree (physical + swap) can be superior to totalReal (physical). You could have more totalFree memory than the total of physical memory (totalReal). Most probably true just after booting the host, I suppose. I don’t know what you could conclude out of if.

Hannes_kruger · 29 January 2020 14:24

To solve this problem, I modified the file:
/opt/librenms/includes/polling/ucd-mib-inc.php and added the line
$memTotalReal = $memTotalReal + $memShared + $memBuffer; on line 142, just before the
$fields = array(
line.

Kevin_Krumm · 29 January 2020 15:01

You should submit a Pull Request in git hub so this change can help others.

seamanjeff · 28 July 2020 20:20

It’s probably a matter of Librenms not reporting the stats that are useful to you. If you’re monitoring Linux systems, that are doing any kind of file service, you’re going to see high memory utilization due to buffer cache. It will use all available memory for buffer cache. This isn’t bad - it makes file access fast and if the system needs it for a process’s working set it will just reclaim it.

So this graph, above, shows that all the memory is in use… that’s not unexpected since this is a NAS device.

This graph shows that there is plenty of memory free, because this system doesn’t do much file access.

What tool on your system is returning different memory numbers from what SNMP delivers?

DerDanilo · 8 September 2020 09:57

But how to configure LibreNMS or SNMP to correctly show this kind of memory usage?

Tried this, what is this supposed to show differently in the system?
This is not update save either I guess.

BloodyIron · 1 October 2020 14:47

As a Sr Sys Admin and Architect I’ve found LibreNMS RAM/memory tracking (as in, non-permanent storage) to be not actually useful as it misrepresents the situation.

The majority of the systems I monitor are VMs, but I also monitor virtual hosts.

First, “Physical Memory”. The naming of this metric is completely useless. All memory is physical, even if it’s virtual, it corresponds to memory used in RAM DIMMs. So this really should be renamed to something far more self-evident as to what this is.

Second, Physical Memory, is almost always 100% used on every system I have. This completely misrepresents what’s going on, as the monitoring appears to combine Linux cache usage with actual application RAM usage. Linux kernel behaviour is that is tends to use less cache as apps use more RAM, and/or push data into swap. This needs to be far more clearly spelled out.

Third, “Virtual Memory” also needs to be renamed to something actually self-evident of what this means. Remember, LibreNMS is designed to be OS agnostics of what it monitors. Furhtermore, this value seems to combine actual total RAM capacity with swap capacity, which is completely useless since we also track swap in LibreNMS. I want to see here the actual RAM usage that applications use, because that’s what really impacts my environment. I have a node that actually has only 32GB of RAM installed, yet this metric reports it as 64GB capacity, and using 32GB of that, because it combines the swap with RAM, and also combines the app RAM usage with the Linux kernel cache, 100% misrepresenting the actual memory usage here.

I know that people have asked in this thread for what the real solution is in each case, and I don’t know what metric accurately represents that from one to the next. But as it sits, this behaviour completely defeats the point of having LibreNMS to monitor memory usage. The stats are useless, inaccurate and misleading. Can we please finally get this added as a priority to the development pipeline already? It’s been like this for years.

Kevin_Krumm · 2 October 2020 13:21

Here is the issue there are no developers - It’s all driven by volunteers so somebody needs to come up with a solution and code it.

appleseed · 2 October 2020 18:32

I actually had a solution and code by patching the net-snmpd(snmpd) it self as well as a LibreNMS patch.

please take a look at

github.com/net-snmp/net-snmp

snmpd: support MemAvailable on Linux

net-snmp:master ← ibigbug:support-linux-avail-mem

opened 07:11AM - 21 Aug 20 UTC

ibigbug

+45 -1

This is my first time walking into this code base, so I don't know what I'm doin…g. But the idea is to add support of reporting MemAvailable on Linux ``` -> % free -h total used free shared buff/cache **available** Mem: 3.8Gi 437Mi 2.5Gi 5.0Mi 936Mi 3.2Gi Swap: 4.0Gi 0B 4.0Gi ``` please guide if I'm doing this correctly and/or anything that is not covered

github.com/librenms/librenms

[WIP]: Memory: read memSysAvail from snmpd

librenms:master ← ibigbug:sys-avail-mem

opened 11:13AM - 19 Sep 20 UTC

ibigbug

+205 -32

There has already been several discussions on Github and LibreNMS community rega…rding the memory usage graph https://github.com/librenms/librenms/issues/3179 https://github.com/librenms/librenms/issues/5660 https://community.librenms.org/t/making-physical-memory-graphs-valid-and-worthwile/10333/9 https://community.librenms.org/t/supporting-memavailable-on-linux/13171?u=appleseed based on this [comment](https://github.com/librenms/librenms/issues/3179#issuecomment-197155806), the "Available" memory couldn't be graphed due to lack of support from snmpd. Now the feature has been [added](https://github.com/net-snmp/net-snmp/pull/167) to net-snmp, I think it makes sense to use the new "Available Memory" data to draw the memory graphs. see also https://github.com/net-snmp/net-snmp/pull/167#issuecomment-678905462 Putting WIP as the commit was checked into net-snmpd code base but not yet released and the release date is not know yet. see: https://github.com/net-snmp/net-snmp/pull/167#issuecomment-683793493 Also I believe we need backward compatibility for this change - please suggest how it can be done best :) DO NOT DELETE THE UNDERLYING TEXT #### Please note > Please read this information carefully. You can run `./scripts/pre-commit.php` to check your code before submitting. - [x] Have you followed our [code guidelines?](http://docs.librenms.org/Developing/Code-Guidelines/) - [x] If my Pull Request does some changes/fixes/enhancements in the WebUI, I have inserted a screenshot of it. ![image](https://user-images.githubusercontent.com/543405/93665690-2b3d2300-faab-11ea-99ac-b64d39c23377.png) ![image](https://user-images.githubusercontent.com/543405/93665701-41e37a00-faab-11ea-9be4-554019bf0bb6.png) ![image](https://user-images.githubusercontent.com/543405/93665707-4ad44b80-faab-11ea-9070-d51d252329c8.png) #### Testers If you would like to test this pull request then please run: `./scripts/github-apply <pr_id>`, i.e `./scripts/github-apply 5926` After you are done testing, you can remove the changes with `./scripts/github-remove`. If there are schema changes, you can ask on discord how to revert.

BloodyIron · 2 October 2020 18:48

Should we advocate for a change to net-snmpd upstream then too? So this can get fixed comprehensively for all of humanity?

Thanks for your contributions @appleseed ! Out of curiosity (can’t read them this very moment), do your changes fall in-line with what I’m proposing, or did you discover a better way to do it than that? I’m curious!

appleseed · 3 October 2020 07:26

Partially - it will "fix“ the memory “almost always 100%” as the OID of “Available Memory”, which is mostly what people will be interested in when they type in free or such, support was added to net-snmp in the first patch mentioned above and it’s checked in, but release date is not confirmed yet.

the second WIP PR is adding ability for LibreNMS to read and show the “Available Memory” so it won’t be always showing 100% unless the Available Memory on your system is really low.

Regarding the Virtual Memory, reading the snmp response, I think this is how snmp agent reports and LibreNMS just displays what it gets as is, please correct if I were wrong.