Dangling Syncoid snapshots

Clete2 · January 5, 2025, 3:48pm

Here’s my setup:

Home:

rpool - SSD mirror
tank - HDD RAIDZ2

Remote:

rpool - HDD RAIDZ2

On both servers, I run Sanoid in cron, which I think is irrelevant to this conversation, but I thought I should mention it anyway.

On my Home server, I run Syncoid to do the following:

Backup rpool → tank (locally)
Backup tank → Remote (partial)
Backup rpool → Remote
Backup Remote’s rpool → Home tank (excluding the backup datasets)

As stated above, I use my home server as the main orchestration for all of this, to keep it simple.

The problem I have is that if a backup fails or is interrupted, old syncoid snapshots live forever (I don’t use --no-sync-snap):

At Home:

~> zfs list -t snap | grep syncoid | wc -l
4839

At Remote:

 ~> zfs list -t snap | grep syncoid | wc -l
4801

A sampling of these defunct snapshots:

...
tank/backup/zfs/local/rpool/images@syncoid_rpool-to-sc_cleteServer_2024-12-23:04:53:40-GMT-05:00                       0B      -  6.81G  -
tank/backup/zfs/local/rpool/images@syncoid_rpool-to-sc_cleteServer_2024-12-24:05:36:16-GMT-05:00                       0B      -  6.81G  -
tank/backup/zfs/local/rpool/images@syncoid_rpool-to-sc_cleteServer_2024-12-25:04:25:37-GMT-05:00                       0B      -  6.81G  -
tank/backup/zfs/local/rpool/images@syncoid_rpool-to-sc_cleteServer_2024-12-26:05:02:49-GMT-05:00                       0B      -  6.81G  -
tank/backup/zfs/local/rpool/images@syncoid_rpool-to-sc_cleteServer_2024-12-27:06:34:29-GMT-05:00                       0B      -  6.81G  -
tank/backup/zfs/local/rpool/images@syncoid_rpool-to-sc_cleteServer_2024-12-28:18:50:42-GMT-05:00                       0B      -  6.81G  -
tank/backup/zfs/local/rpool/images@syncoid_rpool-to-sc_cleteServer_2024-12-29:08:00:38-GMT-05:00                       0B      -  6.81G  -
tank/backup/zfs/local/rpool/images@syncoid_rpool-to-sc_cleteServer_2024-12-30:05:49:32-GMT-05:00                       0B      -  6.81G  -
...

I am afraid if I just delete all the Syncoid snapshots, that I will break replication. Is there any easy command to clean up only the unused ones?

mercenary_sysadmin · January 5, 2025, 3:59pm

You’re misunderstanding your problem.

Syncoid cleans up its own snapshots when possible, but when you catch the sync snaps from one replication chain (eg rpool–>tank) in another replication chain (eg tank–>remote) then there’s no way for the syncoid process tank–>remote to manage the sync snaps created by rpool–>tank, so the rpool–>tank snapshots accumulate at remote.

You can alleviate this by using no-sync-snap, or you can alleviate it by manually destroying “foreign” sync snapshots at remote yourself, every now and then.

You can also manually set the sync snap identifier (which defaults to the hostname of the machine actually running the syncoid process) if you find that helpful; I can’t remember the name of the argument off the top of my head but running syncoid with no arguments will show you a list of arguments you can use, including that one.

Clete2 · January 5, 2025, 4:36pm

Ahhh you are right. I had read previous threads in this forum such as this one: Syncoid, multiple hosts and --no-sync-snap

I had assumed since I was managing all my syncing from one host that I was avoiding such an issue.

It turns out you are right as usual.

On Home (cleteServer), I have hundreds of “tank/backup/zfs/local/rpool/images@syncoid_rpool-to-sc_cleteServer_xx” snapshots. On Remote (what I call SC in the identifier), I have only a single “images@syncoid_rpool-to-sc_cleteServer_xx” snapshot.

I also am getting confused by the --identifiers that I used. I think I may destroy every backup and just make an ID like <source location>-<source pool name or dataset>-to-<destination location>-<destination pool name> which would end up with things like ga-rpool-to-ga-tank and ga-rpool-to-sc-rpool or ga-tank-photos-to-sc-tank. Right now seeing all this photos-to-sc and rpool-to-tank stuff is confusing me.

What if I do this:

Fix identifiers to be more clear
Destroy everything (I have cloud backups of all important data in addition to this cross-backup scheme)
Use --no-sync-snap for local-to-local backups
Continue to use sync snaps for remote backups

Based on what you described, I think that would fix my issues.

One doubt remains: What if Sanoid prunes a local snap that was used for local-to-local replication [say it chose a _frequently snap to use]? Then it would complain. I may just have to, as you said, write some script that removes all the “foreign” snaps.

Clete2 · January 5, 2025, 4:58pm

On second thought, that won’t fix all my issues. I have the issue also that Home rpool -> Remote rpool snaps exist on my Home tank dataset as well.

In summary, extra snaps exist on:

Home tank (Snap is from Home rpool → Remote rpool)*
Remote rpool (Snap is from Home rpool → Home tank)**

May just be best to make that script that iterates through each dataset and removes old snaps.

*:

tank/backup/zfs/local/rpool/images@syncoid_rpool-to-sc_cleteServer_2024-12-27:06:34:29-GMT-05:00                       0B      -  6.81G  -
tank/backup/zfs/local/rpool/images@syncoid_rpool-to-sc_cleteServer_2024-12-28:18:50:42-GMT-05:00                       0B      -  6.81G  -
tank/backup/zfs/local/rpool/images@syncoid_rpool-to-sc_cleteServer_2024-12-29:08:00:38-GMT-05:00                       0B      -  6.81G  -
tank/backup/zfs/local/rpool/images@syncoid_rpool-to-sc_cleteServer_2024-12-30:05:49:32-GMT-05:00                       0B      -  6.81G  -
tank/backup/zfs/local/rpool/images@syncoid_rpool-to-sc_cleteServer_2024-12-31:05:57:17-GMT-05:00                       0B      -  6.81G  -
tank/backup/zfs/local/rpool/images@syncoid_rpool-to-sc_cleteServer_2025-01-01:04:23:23-GMT-05:00                       0B      -  6.81G  -
tank/backup/zfs/local/rpool/images@syncoid_rpool-to-sc_cleteServer_2025-01-02:04:22:36-GMT-05:00                       0B      -  6.81G  -

**:

rpool/backup/zfs/ga/rpool/images@syncoid_rpool-to-tank_cleteServer_2024-12-29:05:28:26-GMT-05:00                                   0B      -  6.81G  -
rpool/backup/zfs/ga/rpool/images@syncoid_rpool-to-tank_cleteServer_2024-12-30:03:37:45-GMT-05:00                                   0B      -  6.81G  -
rpool/backup/zfs/ga/rpool/images@syncoid_rpool-to-tank_cleteServer_2024-12-31:04:05:02-GMT-05:00                                   0B      -  6.81G  -
rpool/backup/zfs/ga/rpool/images@syncoid_rpool-to-tank_cleteServer_2025-01-01:03:05:05-GMT-05:00                                   0B      -  6.81G  -
rpool/backup/zfs/ga/rpool/images@syncoid_rpool-to-tank_cleteServer_2025-01-02:03:05:31-GMT-05:00                                   0B      -  6.81G  -
rpool/backup/zfs/ga/rpool/images@syncoid_rpool-to-tank_cleteServer_2025-01-03:02:54:52-GMT-05:00                                   0B      -  6.81G  -
rpool/backup/zfs/ga/rpool/images@syncoid_rpool-to-tank_cleteServer_2025-01-04:06:35:03-GMT-05:00                                   0B      -  6.81G  -

mercenary_sysadmin · January 5, 2025, 7:50pm

Essentially, if you want to use sync snaps and you’ve got multiple links in a chain (A–>B–>C, or A–>B + A–>C, doesn’t really matter) then you’re going to have to come up with a way to remove the “foreign” sync snaps from where they should not be.

When a single host is always running the syncoid process, you can end up with both foreign sync snapshots cluttering up the pool and, potentially, with syncoid removing “foreign” snapshots that it shouldn’t be removing, which can in turn break the replication chain in one place or the other.

–host-identifier makes it easier to manage those situations, by allowing you to set an identifier for easy manual removal later as well as avoiding issues where syncoid removes “foreign” snapshots from places where it shouldn’t.

For example, you might use the manually selected identifier syncsnap1 for the replication from rpool–>tank, and syncsnap2 for the second replication leg from tank–>remote. Then, on remote, you might run something along the lines of:

root@remote:~# for snap in `zfs list -t snap | awk '{print $1}' | grep syncsnap1` ; do echo zfs destroy $snap ; zfs destroy $snap ; done

Which would destroy all snapshots which include “syncsnap1” in the name, thereby getting rid of all the earlier sync snapshots used to move things from rpool–>tank.

You still have another problem, in your particular situation: since you’re replicating first from rpool–>tank and then from tank–>remote, you essentially can’t use sync snapshots for both sides of it. Only for rpool–>tank.

When you replicate from rpool–>tank, that replication will, of necessity, destroy any sync snapshots created while replicating tank–>remote: those snapshots did not exist on rpool, therefore they will need to be destroyed by the next incoming replication from rpool. See what I mean?

Clete2 · January 5, 2025, 8:31pm

That makes sense. The only thing I’m unsure of is that I don’t see --host-identifier, only --identifier. Should I be seeing 2 different options?

I’m going to spend some time testing this out on a VM with fake pools so that I can better understand it fully.

mercenary_sysadmin · January 5, 2025, 9:43pm

Sorry, it’s just an argument I don’t personally use pretty much ever so it’s easy for me to bobble the name off the top of my head. There’s just the one.

Clete2 · January 5, 2025, 10:10pm

Thanks as always Jim.

I ended up using this script, once for each pool needing to be cleaned:

POOL="tank"
SNAPSHOTS_TO_DELETE="_ga-rpool-to-sc-rpool_"

for snap in `zfs list -H -t snap $POOL -r -o name | grep $SNAPSHOTS_TO_DELETE` ; do
        echo "Destroying $snap"
        zfs destroy $snap
done

I’m redoing all my naming convention and it was just easier to start from scratch. I think I have a good handle on what exactly is going on now. Thanks again!