Just wanna share my war story with the class.
Few months ago I’ve set up a new native ZFS encryption enabled pool. Called the pool
tank_e to distinguish it from my non-encrypted
tank. I’ve enabled
sanoid to take snapshots of 3 of the data sets there. On top of that, I’ve also set up syncoid to back those data sets with the snapshots to another host. Made sure both (sanoid and syncoid) work fine, smiled and moved on.
This morning, while working on something totally different, I’ve spotted log entries like this one in my
2023-08-03T14:15:23.438318-07:00 thinker sanoid: CRITICAL ERROR: zfs snapshot tank_e/root/thinker/kvm_pool@autosnap_2023-08-03_21:15:23_monthly failed, 512 at /usr/sbin/sanoid line 627.
And like, huh? What’s going on? Then I realized that few weeks ago I ditched the native ZFS encryption because it was erroring on me few times a month – my host would spit kernel panic to syslog and lock up on me.
Since that was my private homelab pool, I decided to abandon encryption temporarily (until I figure it out). So, I created an unencrypted dataset and moved (recursively) all my
tank_e/root datasets to the new
tank_e/root_non_e (as in “non encrypted”). And totally forgot to adjust sanoid’s config. It was screaming at
/var/log/syslog. But I haven’t been paying attention… Noticed today by accident while working on something totally different.
So, I’ve updated the paths to my snapshotted data sets in
sanoid.conf and… employed
post_snapshot_script on them to curl my newly created
healthchecks.io endpoints. So, when it happens again, I’ll get paged (I have my healthchecks.io account integrated with my PagerDuty account).
Always monitor your setup The fact that it works today doesn’t mean it’ll work tomorrow. You (or someone from your team) can change things. And, well…