Syncoid, multiple hosts and --no-sync-snap

HankB · March 22, 2024, 7:20pm

Good afternoon,
I’m still wrestling with snapshots created and managed by syncoid. This was thrashed about a bit in another thread https://discourse.practicalzfs.com/t/keeping-a-minimum-number-of-snapshots/1326 but I thought it was wandering far enough afield to warrant its own thread. Specifically, I’m questioning my use of --no-sync-snap and how it prevents me from easily cleaning up snapshots.

It appears to me that --no-sync-snap will choose the oldest available snapshot whether created by sanoid or syncoid. In my case one of my configurations looks like

A -> C 
B -> C
C -> D (with no-sync-snap)

My scripts run these once daily, back to back and in that sequence. The result is that on D I have a lot of snapshots that match syncoid_C. For this reason, I cannot use the script pasted below to clean up the “foreign” snapshots on D. I’m wondering if it makes more sense for me to eliminate the --no-sync-snap option for the C -> D transfer. That will allow syncoid to manage the snapshots it creates and I can remove the other snapshots w/out causing any problems. (*)

Does this sound right or am I misunderstanding something? If it matters, I have other jobs that transfer filesystems around, including C -> E and C -> B (to different pools, not circular.) I’m inclined to think I should not be using --no-sync-snap with any of these to make cleanup simpler.

The script I plan to use to clean things up (and which is still “in testing” is (and USE AT YOUR OWN PERIL!)

hbarta@olive:~/Programming/shell_scripts/syncoid$ cat purge-foreign-syncoid-snaps.sh
#!/bin/bash

# Purge foreign syncoid snapshots that result from pulling
# snapshots creted on remote hosts by pulling snapshots there.
# AKA What a tangled web we weave!
#
# Only valid snapshots include the string "syncoid_$(hostname)"
# 
Usage="Usage: purge-foreign-syncoid-snapshots.sh pool [pool] ... [pool]"

if [ $# == 0 ]
then
    echo "$Usage"
fi

echo "=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+ start  " "$(/bin/date +%Y-%m-%d-%H%M)"

for pool in "$@"
do
    echo "Checking $pool"
    for snap in $(/bin/zfs list -t snap -o name -H -r "$pool"|/bin/grep syncoid|/bin/grep -v "syncoid_$(hostname)")
    do
        echo "destroying $snap"
        /bin/zfs destroy "$snap"
    done
done

echo "=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+ finish " "$(/bin/date +%Y-%m-%d-%H%M)"
echo
hbarta@olive:~/Programming/shell_scripts/syncoid$

(*) D is my remote backup and a 5 hour drive away. If I mess it up, I’m carrying a multi-TB HDD on a road trip to restore it and in the mean time have no remote backups.

mercenary_sysadmin · March 22, 2024, 8:31pm

Sort of.

Syncoid is just an orchestration wrapper for OpenZFS replication, and therefore works (exactly) like it does. It replicates all available snapshots, unless you go out of your way to make it do something different.

The only purpose of the sync snap is to make certain that there is a snapshot to replicate, and to give (especially newbie-r / more casual) users a guarantee that the replication chain won’t be broken, forcing the abandonment of a long-running incremental replication backup scheme to replace it with a new full.

Doing a full backup instead of incremental might not sound like a very big deal to some folks, but when you’re backing up hundreds of TiB or maybe even a PiB or two… it matters. It matters a lot.

distancesprinter · July 31, 2024, 9:47pm

Can you elaborate more on sync snaps and how they achieve these goals? I’m not sure I follow, but maybe that’s because I haven’t experienced first hand the problems that arise.

I tried to review this primer on the wiki, but it really does not discuss sync snaps much at all. The first time I used syncoid, I was totally confused about the syncoid snapshots because I didn’t expect them and they were polluting my carefully crafted snapshot naming conventions.

I also don’t understand what you mean when you say they’re cleaned up automatically. I was encountering lots of errors at first, but I eventually started using --no-sync-snap long before I cleaned up all the errors. Perhaps the errors were thrown before the cleanup?

Thanks!

mercenary_sysadmin · July 31, 2024, 10:30pm

I wanted relatively normal people to be able to use syncoid as a relatively direct replacement for rsync.

If you rsync two directories, then change a file on the source, then rsync again: your change propagates.

If you replicate a dataset, then change a file on it, then replicate it again… you don’t get the changed file, because you didn’t take a snapshot that captured the change, and ZFS replication is replication of snapshots; the live filesystem on the target simply changes to be a clone of the most recent snapshot.

The sync snap allows the naive expectation–I changed a file, I replicated again, my changes were replicated–to succeed.

Syncoid, when --no-sync-snap isn’t used, first takes a snapshot, then replicates, then–assuming replication completes successfully–it gets rid of old sync snapshots on both source and target. But it identifies them by the hostname of the machine running syncoid, so if you’ve got an A–>B–>C, you’ll wind up with “foreign” syncoid snapshots captured on hosts that will never have syncoid running from the same host that created them, so the snapshots will never go away.

If you’re only doing A–>B, then syncoid should successfully manage the snapshots including pruning any but the most recent.