Lesson learned... Always monitor your backup pipeline!

micah · August 4, 2023, 10:06pm

Just wanna share my war story with the class.

Few months ago I’ve set up a new native ZFS encryption enabled pool. Called the pool tank_e to distinguish it from my non-encrypted tank. I’ve enabled sanoid to take snapshots of 3 of the data sets there. On top of that, I’ve also set up syncoid to back those data sets with the snapshots to another host. Made sure both (sanoid and syncoid) work fine, smiled and moved on.

This morning, while working on something totally different, I’ve spotted log entries like this one in my /var/log/syslog:

2023-08-03T14:15:23.438318-07:00 thinker sanoid[348445]: CRITICAL ERROR: zfs snapshot tank_e/root/thinker/kvm_pool@autosnap_2023-08-03_21:15:23_monthly failed, 512 at /usr/sbin/sanoid line 627.

And like, huh? What’s going on? Then I realized that few weeks ago I ditched the native ZFS encryption because it was erroring on me few times a month – my host would spit kernel panic to syslog and lock up on me.

Since that was my private homelab pool, I decided to abandon encryption temporarily (until I figure it out). So, I created an unencrypted dataset and moved (recursively) all my tank_e/root datasets to the new tank_e/root_non_e (as in “non encrypted”). And totally forgot to adjust sanoid’s config. It was screaming at /var/log/syslog. But I haven’t been paying attention… Noticed today by accident while working on something totally different.

So, I’ve updated the paths to my snapshotted data sets in sanoid.conf and… employed post_snapshot_script on them to curl my newly created healthchecks.io endpoints. So, when it happens again, I’ll get paged (I have my healthchecks.io account integrated with my PagerDuty account).

Always monitor your setup The fact that it works today doesn’t mean it’ll work tomorrow. You (or someone from your team) can change things. And, well…

mercenary_sysadmin · August 4, 2023, 10:17pm

Consider using sanoid --monitor-snapshots (which you can tie in with healthchecks.io, if you don’t want to spin up a Nagios instance) for this. It’s much better than trying to monitor the outcome of any given replication process, because it checks to make sure you have snapshots of all types within the freshness parameters defined (they’re defined to reasonable values for you, in the provided templates).

You can also tie sanoid --monitor-health to your heallthchecks account, to make sure you get an immediate heads-up if you throw a disk or otherwise encounter serious issues with the pool itself.