Just wanna share my war story with the class.
Few months ago I’ve set up a new native ZFS encryption enabled pool. Called the pool tank_e
to distinguish it from my non-encrypted tank
. I’ve enabled sanoid
to take snapshots of 3 of the data sets there. On top of that, I’ve also set up syncoid to back those data sets with the snapshots to another host. Made sure both (sanoid and syncoid) work fine, smiled and moved on.
This morning, while working on something totally different, I’ve spotted log entries like this one in my /var/log/syslog
:
2023-08-03T14:15:23.438318-07:00 thinker sanoid[348445]: CRITICAL ERROR: zfs snapshot tank_e/root/thinker/kvm_pool@autosnap_2023-08-03_21:15:23_monthly failed, 512 at /usr/sbin/sanoid line 627.
And like, huh? What’s going on? Then I realized that few weeks ago I ditched the native ZFS encryption because it was erroring on me few times a month – my host would spit kernel panic to syslog and lock up on me.
Since that was my private homelab pool, I decided to abandon encryption temporarily (until I figure it out). So, I created an unencrypted dataset and moved (recursively) all my tank_e/root
datasets to the new tank_e/root_non_e
(as in “non encrypted”). And totally forgot to adjust sanoid’s config. It was screaming at /var/log/syslog
. But I haven’t been paying attention… Noticed today by accident while working on something totally different.
So, I’ve updated the paths to my snapshotted data sets in sanoid.conf
and… employed post_snapshot_script
on them to curl my newly created healthchecks.io
endpoints. So, when it happens again, I’ll get paged (I have my healthchecks.io account integrated with my PagerDuty account).
Always monitor your setup The fact that it works today doesn’t mean it’ll work tomorrow. You (or someone from your team) can change things. And, well…