Sanoid monitoring bash script advice

slowhawkeclipse · October 1, 2024, 1:38am

I’m using sanoid and syncoid to make snapshots and sync them to backup pools. I want to monitor those snapshots and I see that sanoid has built in monitoring functions that work with nagios. I’m using healthchecks.io to monitor most of my homelab. So I wrote this bash script to monitor my pools and snapshots. Any suggestions?

#!/bin/bash

output_snapshots=$(/usr/sbin/sanoid --monitor-snapshots)
output_capacity=$(/usr/sbin/sanoid --monitor-capacity)
output_health=$(/usr/sbin/sanoid --monitor-health)

if [[ $output_snapshots == OK* ]] && [[ $output_capacity == OK* ]] && [[ $output_health == OK* ]]; then
  curl -m 10 --retry 5 https://healthchecks.io/ping/xxxxxxxxxxxxx
else
  echo "One or more checks did not return OK."
  curl -m 10 --retry 5 https://healthchecks.io/ping/xxxxxxxxxxxxx/fail

tmo · October 1, 2024, 12:35pm

If you’d like to make the script more robust I suggest adding some edge case coverage.

check and handle if curl fails. Right now it looks possible for Sanoid to return OK but curl could fail (say, network is down) without a way for healthchecks.io to distinguish between each case
log errors to a local file.

mercenary_sysadmin · October 2, 2024, 1:29pm

Consider capturing the friendly output of each sanoid check, not just the exit code. Then send the friendly output, concatenated from all non-OK checks, to healthchecks.io instead of just “one or more checks did not return OK.”

That way you’ll be able to see what is wrong, rather than that merely something is wrong, when you get your alerts from the healthchecks.io service.

slowhawkeclipse · October 20, 2024, 11:17pm

Thanks for the ideas, I appreciate it!
I implemented both of them (logging and sending output to healthchecks), it’s been quite useful. I’ve run into one issue:

Sometimes sanoid --monitor-snapshots takes a while to return an output. When this happens, the script sets output_snapshots as blank. Then the script interprets this output as something other than OK* and sends healthchecks.io a fail check.

I noticed that if you run sanoid --monitor-snapshots and it takes a while, subsequently running the command returns much quicker. So I added a

/usr/sbin/sanoid --monitor-snapshots
sleep 600

to the beginning of my script. This so far has been working, but seems like a bad way to do it. This also seems like more of an issue with bash or systemd (I’m using a systemd timer and service to run the script). Any suggestions?

mercenary_sysadmin · October 21, 2024, 2:00am

If --monitor-snapshots “takes a while” you should seriously consider decreasing the number of snapshots you’re keeping on hand. The thing that takes so long is literally just asking zfs to list all your snapshots at all, and the reason it completes so much faster the next time around is because Sanoid maintains a cache of the snapshot list, for exactly this reason, but it still needs to invalidate the cache and regenerate everything every now and then or else it might wind up badly mistaken about what really is or is not present on the system.

This is not an issue with total number of snapshots on the pool, typically, it’s an issue with the total number of snapshots per individual dataset.

slowhawkeclipse · October 21, 2024, 12:41pm

How many snapshots is too many? I have ~660 snapshots on my server (split between two pools: 135 on one, 525 on the other), and about 660 snapshots on my backup server (all on one pool).

mercenary_sysadmin · October 21, 2024, 3:15pm

Again, it’s not about the total per pool, it’s about the total per dataset. Where exactly you start getting into trouble is different for different architectures and different workloads, but usually it’s somewhere around the 100+ mark.

Most of my pools have many thousands of total snapshots, but only sixty or seventy per individual dataset: enough for thirty hourly, thirty daily, three monthly, and the occasional oddball snapshot.

slowhawkeclipse · October 21, 2024, 3:33pm

That’s pretty similar to what I have. The dataset with the most snapshots has 76 snapshots. I’m saving 30 hourly, 30 daily, 12 monthly, 2 yearly. This is on a system with a i5-8600T, 2x HDDs and 16GB of ram.

When I say it takes a while, it takes 21 seconds to run sanoid --monitor-snapshots --force-update. So it’s not taking forever, but it seems to be long enough to trip up my script.

slowhawkeclipse · October 22, 2024, 12:01am

21 seconds seems pretty reasonable to me. Which is why I think the issue is with how I configured the script or bash or systemd I think.

mercenary_sysadmin · October 22, 2024, 6:04am

Yeah 21 seconds is perfectly livable; the answer there is probably just to run sanoid --monitor-snapshots twice in a row, and only pay attention to the output the second time. That way you’ll always be getting the on-cache response, which is quick enough not to fall afoul of whatever timeout you have going on now.

slowhawkeclipse · October 22, 2024, 1:13pm

Sound good, thanks for the suggestions!

Topslakr · October 22, 2024, 2:56pm

I’d love to do the same on my system as you’re doing – any chance you’re planning to post the final script somewhere? Call me lazy, but taking advantage of your efforts since your first post would be great

slowhawkeclipse · October 22, 2024, 5:35pm

Sure, here you go:

#!/bin/bash

/usr/sbin/sanoid --monitor-snapshots
sleep 600

# Define the log file
LOGFILE="/where/you/want/your/logfile.log"

# Function to log the outputs with a timestamp
log_output() {
  echo "$(date '+%Y-%m-%d %H:%M:%S') - $1" >> "$LOGFILE"
}

# Run the commands and store their outputs
output_snapshots=$(/usr/sbin/sanoid --monitor-snapshots)
output_capacity=$(/usr/sbin/sanoid --monitor-capacity)
output_health=$(/usr/sbin/sanoid --monitor-health)

# Log the outputs with timestamps
log_output "Snapshot Check: $output_snapshots"
log_output "Capacity Check: $output_capacity"
log_output "Health Check: $output_health"

# Concatenate outputs
output_all=$output_snapshots $output_capacity $output_health

# Check if all outputs start with "OK"
if [[ $output_snapshots == OK* ]] && [[ $output_capacity == OK* ]] && [[ $output_health == OK* ]]; then
  # If all checks are OK, ping healthchecks
  curl -m 10 --retry 5 https://ping.url
  log_output "All checks are OK. Healthcheck pinged successfully."
else
  echo "One or more checks did not return OK."
  curl -fsS -m 10 --retry 5 --data-raw "$output_all" https://ping.url/fail

  log_output "One or more checks did not return OK. Failure ping sent."
fi

Replace the logfile and ping url with where you want the logfile to live and your ping url. Any suggestions are very welcome!

mercenary_sysadmin · October 22, 2024, 6:43pm

You commented the rest of your script; don’t forget to comment this part too, or else you’re going to be wondering why the hell you’re running that command twice in a year or two. =)

slowhawkeclipse · October 22, 2024, 7:03pm

Great point!

#!/bin/bash

# Make sure the list of snapshots are in cache so the script doesn't timeout
/usr/sbin/sanoid --monitor-snapshots
sleep 600

Topslakr · October 22, 2024, 9:06pm

This is awesome! Thank you @slowhawkeclipse!

This may be totally inappropriate and, if it is, know I meant no harm and I will take it down immediately, but I’ve posted this to my Github to make it a little easier to share and tweak. I posted, initially, your actual code updated only to put credit in for you and send people here.

github.com

Topslakr/Bash_Scripts/blob/502db4f4bcc6c17c875e47d0cbe48e38661603b4/sanoid_healthcheck

#!/bin/bash
#This is 100% the work of slowhawkeclipse on the ZFS Discord.
#User: https://discourse.practicalzfs.com/u/slowhawkeclipse/summary
#Thread: https://discourse.practicalzfs.com/t/sanoid-monitoring-bash-script-advice/1849

/usr/sbin/sanoid --monitor-snapshots
sleep 600

# Define the log file
LOGFILE="/where/you/want/your/logfile.log"

# Function to log the outputs with a timestamp
log_output() {
  echo "$(date '+%Y-%m-%d %H:%M:%S') - $1" >> "$LOGFILE"
}

# Run the commands and store their outputs
output_snapshots=$(/usr/sbin/sanoid --monitor-snapshots)
output_capacity=$(/usr/sbin/sanoid --monitor-capacity)
output_health=$(/usr/sbin/sanoid --monitor-health)

This file has been truncated. show original

From there, I updated the code to suit my wants and preferences:

github.com

Topslakr/Bash_Scripts/blob/main/sanoid_healthcheck

#!/bin/bash
#This is based on the work of slowhawkeclipse on the ZFS Discord.
#User: https://discourse.practicalzfs.com/u/slowhawkeclipse/summary
#Thread: https://discourse.practicalzfs.com/t/sanoid-monitoring-bash-script-advice/1849

# Define Variables
LOGFILE="/path/to/your/log"
HEALTHCHECK="https://your.url.here"
FAIL_URL="${HEALTHCHECK}/fail"

# Function to log the outputs with a timestamp, purging old log first
log_output() {
  echo "$(date '+%Y-%m-%d %H:%M:%S') - $1" > "$LOGFILE"
}

log_output "Initial Check of Snapshots"
/usr/sbin/sanoid --monitor-snapshots > /dev/null
log_output "Allow Snapshots DB to be built if needed"
sleep 30

This file has been truncated. show original

The only other detail to share is that your code did throw an error for me for your concatenation, but that was fixed by adding quotes to the variables contents.

If you have your own Github, would prefer it not be on Github at all, etc., I will 100% do that. I am NOT trying to usurp you here at all.

mercenary_sysadmin · October 23, 2024, 1:07am

Hey @slowhawkeclipse, please choose a license for your script, since it’s generating interest among other folks. I recommend GPLv3 if you want strong copyleft, BSD 2-clause if you want weak copyleft, or BSD-0 clause if you want the closest thing to “public domain” that’s actually under your control (private individuals actually cannot, themselves, declare that things are or are not in the public domain).

If you’re not at all a license wonk and you’re hesitating and this makes you anxious: I licensed Sanoid itself under GPLv3, so if you choose that one, you’ll be in the company that led you here in the first place.

slowhawkeclipse · October 23, 2024, 7:19pm

Topslakr - Posting it on github is fine with me. I appreciate your edits!

I’m definitely not a license wonk, let’s go with GPLv3.

mercenary_sysadmin · October 23, 2024, 9:30pm

Thanks for specifying! I know it doesn’t always matter much to hobbyists, but there are a lot of situations where it’s either not possible or not safe to use unlicensed code. <3

naut · November 22, 2024, 10:18am

I just want to point out that unless the script is called at time that is synchronized to the sanoid.timer events, there is a possibility that the sanoid snapshot cache file times out inbetween the two calls.

Also, in a standard setup, sanoid must be run as root to be allowed to update the cache, and your monitoring script probably isn’t. I ran into this when using Nagios to monitor my snapshots. But that is no longer a problem, since sanoid as of pretty recently has a much longer cache timeout value if only using the --monitor-* parameters.

I’m not sure what happens if the timer triggers a cache update, which is in progress when the monitoring script is called.