Add supervisord monitoring

hvanoch · 11 January 2022 11:05

We are interested in monitoring for supervisor. We would like monitoring on the state of the processes. That way we can add alerts if for instance a process goes in state “FATAL”.

How would I go about adding this to LibreNMS? I currently have the client code ready which outputs something like this:
{
“version”: 1,
“error”: 0,
“errorString”: 0,
“data”: {
“api_version”: “3.0”,
“supervisor_version”: “3.4.0”,
“processes”: [
{
“name”: “consumers-default_00”,
“group”: “consumers”,
“statename”: “RUNNING”,
“state”: 20,
“error”: null,
“start”: 1641897745,
“stop”: 1641897744,
“now”: 1641898092,
“uptime”: 347
},
{
“name”: “consumers-default_01”,
“group”: “consumers”,
“statename”: “RUNNING”,
“state”: 20,
“error”: null,
“start”: 1641897745,
“stop”: 1641897744,
“now”: 1641898092,
“uptime”: 347
}
]
}
}
The main issue I am facing atm is that the most interested part is the actual state. Ideally this is a string. I believe LibreNMS only stores numbers, so this is a problem.

I also wonder if it is possible to show instead of graph maybe a table? I haven’t come across examples of this.
Or what should this graph look like to clearly show the status of each process.

Thanks

laf · 11 January 2022 20:42

Quickest way would be to write a script that accepts -H HOSTNAME args and then outputs in a nagios check script format and response code then use the service checks to call it.

These are the expected status codes the script should return:

[0 => ‘OK’, 1 => ‘Warning’, 3 => ‘Unknown’]

It can also record performance data from the output but you’ll need to check the code to understand how to use that.

hvanoch · 12 January 2022 14:24

Changed my approach a bit and created a PR.

github.com/librenms/librenms

Add supervisord application

librenms:master ← hvanoch:supervisord

opened 02:07PM - 12 Jan 22 UTC

hvanoch

+322 -0

This PR adds stats for [supervisord](http://supervisord.org/). It will show the …total number of processes per status. It also adds the up-time and state (number used by supervisor) per process. This up-time can be useful to detect if a process is stuck if we would expect to only run for a certain amount of time. This allows to set alerting on certain supervisor statuses. PR for the agent: https://github.com/librenms/librenms-agent/pull/392 DO NOT DELETE THE UNDERLYING TEXT #### Please note > Please read this information carefully. You can run `./lnms dev:check` to check your code before submitting. - [x] Have you followed our [code guidelines?](https://docs.librenms.org/Developing/Code-Guidelines/) - [ ] If my Pull Request does some changes/fixes/enhancements in the WebUI, I have inserted a screenshot of it. - [x] If my Pull Request makes discovery/polling/yaml changes, I have added/updated [test data](https://docs.librenms.org/Developing/os/Test-Units/). #### Testers If you would like to test this pull request then please run: `./scripts/github-apply <pr_id>`, i.e `./scripts/github-apply 5926` After you are done testing, you can remove the changes with `./scripts/github-remove`. If there are schema changes, you can ask on discord how to revert.

This adds stats on the total processes and their status.
Also shows uptime per process.

This should give us enough data to add alerts on the important things.

system · 12 April 2022 14:25

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.