I want to set up nagios for zfs pool health and replication monitoring.
Is there anyone who has done this before and can recommend resources for getting started?
I want to set up nagios for zfs pool health and replication monitoring.
I gave Nagios a try but was less than pleased by some of the things I ran into.
- Still using Python 2. This was last year long after support for Python 2 has been dropped.
- Requirement to manually edit config files to add each host.
It was a trial and I moved on. I’m using Checkmk (raw) in my home lab to monitor hosts, using it from anything from a Pi Zero W to my X86_64 file server. It knows about ZFS and reports pool health. AFAIK it is a fork of Nagios. I even monitor TP-Link smart bulbs (ping only) because the frequent pinging seems to help them to remain connected.
There is quite a bit of documentation on their site. Some things like SMART HDD monitoring took a little tweaking but overall I’ve been happy.
You might want to look at adding syncoid if you haven’t already. It comes with three little scripts that check space and snapshot ‘freshness’ and output in a format compatible with nagios.
What’s your experience with checkmk’s performance?
I tried running various versions of their official “raw” image in docker on my vps and without any hosts or services added the cpu and ram shoots through the roof the second i log in the web interface.
It freezes up immediatelly and after a couple of minutes the service crashes and reboots.
I tried the daily 2.3 build, the latest stable 2.2 and the latest 2.1 but all of them had the same issue.
I find it hard to believe that getting the data to build the presentation for 0 hosts takes that much resources.
I spent a full afternoon and evening on this and right now im leaning more against just trying nagios, even thought that probably means i will have to do some tinkering.
Thanks for the tip!
I already use syncoid so i will try to use these scripts if i try nagios
Checkmk performance did not rise to anything that would grab my attention. I haven’t seen anything like what you experienced.
I’m running it in a Docker container on Debian (was Bullseye at time of installation, now Bookworm.) on a Supermicro X8SIL MB with
Intel(R) Xeon(R) CPU X3460 @ 2.80GHz (as identified by
lscpu) and 16GB RAM, so not exactly a beefy host. I’m monitoring 36 hosts, about half of which are ping only and the rest a mix of Debian hosts and Raspberry Pis.
I took a stab at monitoring performance. I watched
top while loading the Checkmk page and navigating several screens. At some points I saw several cores running Python at 100% but it never saturated the entire processor. I’m not aware that it has ever crashed, but since it’s running in Docker set to restart unless stopped, Docker might be restarting it without me noticing.
I started with 2.1. and have upgraded once and am now on 2.2.0.p16. Docker is
Docker version 24.0.7, build afdd53b installed from the Docker repo. I wonder if there is something about Docker in a virtualized environment that is causing problems. Or is there a pathological problem with nothing monitored?
I know that Jim Salter (the mover behind this site) uses Nagios so if you have further questions about that, he can probably help.
The VPS is very weak so perhaps it’s just underpowered.
I’ll have to try it out on some on prem servers.
Thank you for having a look!
This comes up so frequently, it’s pretty clear that I really need to get off my ass and write a guide for setting up Nagios properly. Lord knows I’ve never found one anywhere either; I just had to figure it out for myself over years of production use.
Yeah I had to do the same.
It’s not super hard to get working, but their documentation is dense and not particularly well thought out.
Once my wrapped my head around how templating and inheritance worked, it kinda clicked.
@mercenary_sysadmin I’m curious to know your thoughts on his first point, about it still running Python2. Depending on how it’s setup, and if you are monitoring external networks, it would be exposed to the internet at some point. What’s the risk mitigation on that?
Don’t expose it to the internet. Monitor devices via a VPN tunnel.
Wireguard is perfect for this.
First off, I’m not sure what the fuss is about python2. It’s still supported and still patched.
Second, if you’re exposing access to your monitoring system to the naked Internet, you’re doing it DESPERATELY wrong. My Nagios instances cannot be touched at all unless you’re on a specific, trusted wireguard tunnel–a separate one from the wireguard tunnels the Nagios instance reaches out to the clients it’s monitoring from.
Meanwhile, the monitoring tunnels allow each monitored client the ability to ICMP ping the instance itself–but not each other, and nothing more than a ping. This lets the clients easily verify connectivity, while exposing almost no attack surface whatsoever.
I double checked to see if I was miss-informed around python2, but I found this on the python website.
We are volunteers who make and take care of the Python programming language. We have decided that January 1, 2020, was the day that we sunset Python 2. That means that we will not improve it anymore after that day, even if someone finds a security problem in it. You should upgrade to Python 3 as soon as you can.
Am I missing something?
For your second point, I am 100% with you on that. Sounds like the classic mitigation of “If it doesn’t have to be exposed, don’t expose it”.
My concern is around monitoring endpoints like user laptops, outside of the trusted tunnels and networks, and monitoring more than just ICMP. IE Sending system logs over a secure TCP connection. But this certainly could be mitigated by running VPN full tunnel at all times on those devices as well.
What you’re missing–and to be fair, this might not be sufficient for you–is that while the upstream vendor isn’t maintaining Python2 any more, repository maintainers are still backporting security fixes locally. This can be expected to continue for as long as Python2 is in your distribution’s main repository.
Edit: and what I missed is the sunset period in my own distro, lol/sigh: Canonical stopped backporting fixes a few months ago. D’oh. Ubuntu To End Python 2 Support
Edit2: and I’m missing where the python dependency is? I don’t see a dependency on any version of Python: Ubuntu – Details of package nagios4-cgi in jammy
What am I (still) missing?
Edit3: confirmed: there is no Python2 dependency in Nagios4. First, I check to make sure there isn’t any Python2 already on this (22.04) system:
root@elden:~# dpkg --get-selections | grep -y python2
Now, I ask for an install of nagios4 and look at the dependencies:
root@elden:~# apt install nagios4
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following additional packages will be installed:
bsd-mailx libapache2-mod-php libapache2-mod-php8.1 libjs-jquery
liblockfile-bin liblockfile1 libmysqlclient21 libpq5 libradcli4
liburiparser1 monitoring-plugins monitoring-plugins-basic
monitoring-plugins-common monitoring-plugins-standard mysql-common
nagios-images nagios4-cgi nagios4-common nagios4-core php-common php8.1-cli
php8.1-common php8.1-opcache php8.1-readline python3-gpg python3-samba
python3-tdb rpcbind samba-common samba-common-bin samba-dsdb-modules
FWIW, I checked nagios-nrpe-server also. Still no Python2.
You’re welcome to any of these checks:
Good to hear there is no more Python2 in Nagios or plugins. I was trying to build NCPA for ARM hosts and as of 2022-12-13 it wanted Python2.7.
I wasn’t aware that Distros were still using Pytthon2.7. I thought the statement from the Python devs was the final word. IAC, I’m on Debian 12 (bookworm) and there is no Python2 package in the Debian repos.
IAC, I’m on Debian 12 (bookworm) and there is no Python2 package in the Debian repos.
Then why would you think there was Python2 in Nagios itself? If Nagios is in the repos (which it absolutely is) and Python2 isn’t… Welp.
NCPA was recommended on the Nagios web site and is part of the project: Monitoring Agent · NCPA
Looks to me like NCPA updated its bundled python version to python3 in December of 2021 with NCPA v2.4.0:
- Changed python default plugin extension to python3 (#786) (ccztux)
And there’s no question that it was python3 as of NCPA v3.0, released last November:
- Updated the bundled Python version to 3.11.3 (PhreditorNG)