Sanoid monitoring bash script advice

I’m using sanoid and syncoid to make snapshots and sync them to backup pools. I want to monitor those snapshots and I see that sanoid has built in monitoring functions that work with nagios. I’m using healthchecks.io to monitor most of my homelab. So I wrote this bash script to monitor my pools and snapshots. Any suggestions?

#!/bin/bash

output_snapshots=$(/usr/sbin/sanoid --monitor-snapshots)
output_capacity=$(/usr/sbin/sanoid --monitor-capacity)
output_health=$(/usr/sbin/sanoid --monitor-health)

if [[ $output_snapshots == OK* ]] && [[ $output_capacity == OK* ]] && [[ $output_health == OK* ]]; then
  curl -m 10 --retry 5 https://healthchecks.io/ping/xxxxxxxxxxxxx
else
  echo "One or more checks did not return OK."
  curl -m 10 --retry 5 https://healthchecks.io/ping/xxxxxxxxxxxxx/fail
2 Likes

If you’d like to make the script more robust I suggest adding some edge case coverage.

  • check and handle if curl fails. Right now it looks possible for Sanoid to return OK but curl could fail (say, network is down) without a way for healthchecks.io to distinguish between each case

  • log errors to a local file.

Consider capturing the friendly output of each sanoid check, not just the exit code. Then send the friendly output, concatenated from all non-OK checks, to healthchecks.io instead of just “one or more checks did not return OK.”

That way you’ll be able to see what is wrong, rather than that merely something is wrong, when you get your alerts from the healthchecks.io service.

1 Like

Thanks for the ideas, I appreciate it!
I implemented both of them (logging and sending output to healthchecks), it’s been quite useful. I’ve run into one issue:

Sometimes sanoid --monitor-snapshots takes a while to return an output. When this happens, the script sets output_snapshots as blank. Then the script interprets this output as something other than OK* and sends healthchecks.io a fail check.

I noticed that if you run sanoid --monitor-snapshots and it takes a while, subsequently running the command returns much quicker. So I added a

/usr/sbin/sanoid --monitor-snapshots
sleep 600

to the beginning of my script. This so far has been working, but seems like a bad way to do it. This also seems like more of an issue with bash or systemd (I’m using a systemd timer and service to run the script). Any suggestions?

If --monitor-snapshots “takes a while” you should seriously consider decreasing the number of snapshots you’re keeping on hand. The thing that takes so long is literally just asking zfs to list all your snapshots at all, and the reason it completes so much faster the next time around is because Sanoid maintains a cache of the snapshot list, for exactly this reason, but it still needs to invalidate the cache and regenerate everything every now and then or else it might wind up badly mistaken about what really is or is not present on the system.

This is not an issue with total number of snapshots on the pool, typically, it’s an issue with the total number of snapshots per individual dataset.

How many snapshots is too many? I have ~660 snapshots on my server (split between two pools: 135 on one, 525 on the other), and about 660 snapshots on my backup server (all on one pool).

Again, it’s not about the total per pool, it’s about the total per dataset. Where exactly you start getting into trouble is different for different architectures and different workloads, but usually it’s somewhere around the 100+ mark.

Most of my pools have many thousands of total snapshots, but only sixty or seventy per individual dataset: enough for thirty hourly, thirty daily, three monthly, and the occasional oddball snapshot.

That’s pretty similar to what I have. The dataset with the most snapshots has 76 snapshots. I’m saving 30 hourly, 30 daily, 12 monthly, 2 yearly. This is on a system with a i5-8600T, 2x HDDs and 16GB of ram.

When I say it takes a while, it takes 21 seconds to run sanoid --monitor-snapshots --force-update. So it’s not taking forever, but it seems to be long enough to trip up my script.

21 seconds seems pretty reasonable to me. Which is why I think the issue is with how I configured the script or bash or systemd I think.

Yeah 21 seconds is perfectly livable; the answer there is probably just to run sanoid --monitor-snapshots twice in a row, and only pay attention to the output the second time. That way you’ll always be getting the on-cache response, which is quick enough not to fall afoul of whatever timeout you have going on now.

Sound good, thanks for the suggestions!

I’d love to do the same on my system as you’re doing – any chance you’re planning to post the final script somewhere? Call me lazy, but taking advantage of your efforts since your first post would be great :slight_smile:

Sure, here you go:

#!/bin/bash

/usr/sbin/sanoid --monitor-snapshots
sleep 600

# Define the log file
LOGFILE="/where/you/want/your/logfile.log"

# Function to log the outputs with a timestamp
log_output() {
  echo "$(date '+%Y-%m-%d %H:%M:%S') - $1" >> "$LOGFILE"
}

# Run the commands and store their outputs
output_snapshots=$(/usr/sbin/sanoid --monitor-snapshots)
output_capacity=$(/usr/sbin/sanoid --monitor-capacity)
output_health=$(/usr/sbin/sanoid --monitor-health)

# Log the outputs with timestamps
log_output "Snapshot Check: $output_snapshots"
log_output "Capacity Check: $output_capacity"
log_output "Health Check: $output_health"

# Concatenate outputs
output_all=$output_snapshots $output_capacity $output_health

# Check if all outputs start with "OK"
if [[ $output_snapshots == OK* ]] && [[ $output_capacity == OK* ]] && [[ $output_health == OK* ]]; then
  # If all checks are OK, ping healthchecks
  curl -m 10 --retry 5 https://ping.url
  log_output "All checks are OK. Healthcheck pinged successfully."
else
  echo "One or more checks did not return OK."
  curl -fsS -m 10 --retry 5 --data-raw "$output_all" https://ping.url/fail

  log_output "One or more checks did not return OK. Failure ping sent."
fi

Replace the logfile and ping url with where you want the logfile to live and your ping url. Any suggestions are very welcome!

4 Likes

You commented the rest of your script; don’t forget to comment this part too, or else you’re going to be wondering why the hell you’re running that command twice in a year or two. =)

Great point!

#!/bin/bash

# Make sure the list of snapshots are in cache so the script doesn't timeout
/usr/sbin/sanoid --monitor-snapshots
sleep 600

2 Likes

This is awesome! Thank you @slowhawkeclipse!

This may be totally inappropriate and, if it is, know I meant no harm and I will take it down immediately, but I’ve posted this to my Github to make it a little easier to share and tweak. I posted, initially, your actual code updated only to put credit in for you and send people here.

From there, I updated the code to suit my wants and preferences:

The only other detail to share is that your code did throw an error for me for your concatenation, but that was fixed by adding quotes to the variables contents.

If you have your own Github, would prefer it not be on Github at all, etc., I will 100% do that. I am NOT trying to usurp you here at all.

1 Like

Hey @slowhawkeclipse, please choose a license for your script, since it’s generating interest among other folks. I recommend GPLv3 if you want strong copyleft, BSD 2-clause if you want weak copyleft, or BSD-0 clause if you want the closest thing to “public domain” that’s actually under your control (private individuals actually cannot, themselves, declare that things are or are not in the public domain).

If you’re not at all a license wonk and you’re hesitating and this makes you anxious: I licensed Sanoid itself under GPLv3, so if you choose that one, you’ll be in the company that led you here in the first place. :slight_smile:

2 Likes

Topslakr - Posting it on github is fine with me. I appreciate your edits!

I’m definitely not a license wonk, let’s go with GPLv3.

2 Likes

Thanks for specifying! I know it doesn’t always matter much to hobbyists, but there are a lot of situations where it’s either not possible or not safe to use unlicensed code. <3

I just want to point out that unless the script is called at time that is synchronized to the sanoid.timer events, there is a possibility that the sanoid snapshot cache file times out inbetween the two calls.

Also, in a standard setup, sanoid must be run as root to be allowed to update the cache, and your monitoring script probably isn’t. I ran into this when using Nagios to monitor my snapshots. But that is no longer a problem, since sanoid as of pretty recently has a much longer cache timeout value if only using the --monitor-* parameters.

I’m not sure what happens if the timer triggers a cache update, which is in progress when the monitoring script is called.