ZFS self-healing via remote snapshot

tom_aus · February 11, 2024, 11:10am

G’day

Very new to ZFS and Practical ZFS but I am interested in using ZFS for a Jellyfin server, however, I do not have enough hard drives to use RAIDZ and have a backup as I only have two drives. I was planning on using one drive in the main server and replicating it’s contents to server located remotely. I was just wondering if ZFS is able to, in the event that it finds corruption on the main drive, use the snapshot on the remote server to “heal” its data if I am willing to roll the main drive back to a shared snapshot?

Sorry if this doesn’t make sense or is a duplicate.

Cheers,

tvcvt · February 11, 2024, 4:22pm

I don’t think it’s possible to do that automatically without some scripting, but you could certainly roll back a snapshot (or even copy specific files) from a remote replication target.

Also, if you start with two individual disks in separate pools like you’re suggesting you can add a disk to each pool later on and turn them into mirrors.

mercenary_sysadmin · February 11, 2024, 8:36pm

I was just wondering if ZFS is able to, in the event that it finds corruption on the main drive, use the snapshot on the remote server to “heal” its data if I am willing to roll the main drive back to a shared snapshot?

Let’s say you’re using sanoid to maintain 30 hourly snapshots, 30 dailies, and 3 monthlies. Now, let’s say that some data on your server gets corrupted on-disk–and let’s say that data is on a block that’s still referenced in the live filesystem, and is thirty-seven days old.

This means that replication won’t work (because there’s a corrupt block), so in order to get replication back, you’ll first need to destroy the corrupt block. In this case, you’ll do that by rolling back to your second monthly (to get behind the appearance of the block which became corrupt).

Of course, that leaves you losing somewhere between 30 and 60 days of data, so the next step is to replicate backwards from your backup server onto your production. You’ve got the second monthly as a common snapshot between the two, so syncoid (companion app to sanoid) picks that up as a base, and uses the newest snapshot present on the backup (probably last night’s daily) as the other end; once it finishes replicating, you’ve improved your position from 30-60 days’ data lost (the local snapshot, after the rollback) to <1 days’ data lost (the time between now, and the most recent replication to backup).

tom_aus · February 12, 2024, 3:01am

Lovely, that sounds fantastic.

Will ZFS let me know how many snapshot I need to rollback so that I am before the corruption?

Thanks for the help, it is greatly appreciated

hernil · February 14, 2024, 11:00am

It would still be less clunky and orders of magnitude faster if scrub could be given a remote endpoint to use for self-healing corrupted blocks. Would be a really neat feature!