Syncoid between 3 devices

hyperboly · June 2, 2025, 3:15pm

The core issue is the hw-backups not being synced correctly and im not sure how to fix it. there are 3 devices that independently snapshot the dataset. my laptop, my main server, and this backup NAS. the backup NAS does not have sanoid installed to it, my main server and my laptop does

i have a feeling this is mainly because I don’t understand the sanoid options well enough to resolve this on my own

here’s the error that pops up on proxmox

# main server (pve/debian)
[pool-01/hw-backups]      
        # pick one or more templates - they're defined (and editable) below. Comma separated, processed in order.
        # in this example, template_demo's daily value overrides template_production's daily value.
        use_template = backup
        recursive = yes

...

[template_backup]
        autoprune = yes
        frequently = 0
        hourly = 30
        daily = 90
        monthly = 12
        yearly = 0

        ### don't take new snapshots - snapshots on backup
        ### datasets are replicated in from source, not
        ### generated locally
        autosnap = no

        ### monitor hourlies and dailies, but don't warn or
        ### crit until they're over 48h old, since replication
        ### is typically daily only
        hourly_warn = 2880
        hourly_crit = 3600
        daily_warn = 48
        daily_crit = 60

syncoid command from proxmox to backup NAS

/usr/sbin/syncoid --recursive --delete-target-snapshots pool-01 truenas_admin@192.168.100.146:backup/pool-01

syncoid command from laptop to main server/proxmox

    commands."backup_persist" = {
      source = "rpool/persist";
      target = "hyperboly@192.168.100.130:pool-01/hw-backups/nixon-backup";
      extraArgs = [ "--sshport=2200"
        "--no-privilege-elevation"
        "--delete-target-snapshots"
        "--no-stream"
      ];
      recursive = true;
    };

# laptop config in nix
  services.zfs.trim.enable = true;
  services.sanoid = {
    enable = true;
    datasets = {
      "rpool/persist" = { # using nixOS impermanence so this is the only dataset that needs to be synced
        hourly = 50;
        daily = 15;
        weekly = 3;
        monthly = 1;
      };

      "rpool/persist/videos" = {
        hourly = 0;
        daily = 1;
        weekly = 3;
        monthly = 1;
      };
      "rpool/persist/steam" = {
        hourly = 0;
        daily = 0;
        weekly = 0;
        monthly = 0;
      };
    };
  };

mercenary_sysadmin · June 3, 2025, 12:05am

Not after you replicate, there aren’t!

Snapshot chains cannot bifurcate. When you replicate a snapshot, you wipe out any snapshots on the target newer than the one on the source which you’re basing the incremental on.

So, eg:

What you can do is delete A0001 on either or both sides, now that both servers have A0002, which can serve as a basis for further incremental replication, with or without earlier snapshots. What you cannot do is keep B0001 and still receive A0002.

That’s not a syncoid problem, that’s a “that is literally not how OpenZFS replication does or can work” problem. Sorry!

hyperboly · June 4, 2025, 1:16am

I’m considering a solution where point A sends to point B and C independently, is this a good enough approach? it seems that A->B->C is not easily achievable.

mercenary_sysadmin · June 4, 2025, 3:34am

A->B->C is easily achievable, although you need to be aware of the fact that now B is a single point of failure: if it fails, C stops getting backups also.

When you’ve got more than one replication partner springing from a single source, whether it’s B<–A–>C or A–>B–>C, the tricky part winds up being the sync snapshots. Those are handy for extreme newbies with just two boxes who don’t really understand how it all works, but once you’re trying to do a three box setup, you need to get a bit more skilled up.

You will most likely want to use --no-sync-snap with syncoid, to keep it from bothering to try to create snapshots at synchronization time. At a minimum you want to use those only from A–>B, not from B–>C, if you’re still doing an A–>B–>C topology. Because, as we covered already, any snapshots you take on B will disappear the next time you get a replication from A!

Instead, you need to set sanoid up on all three boxes, with appropriate configs. At the simplest level, you use the production template on any datasets on A, and for B and C, you use backup template if you’re expecting replication once every 24 hours, or the hotspare template if you expect replication to happen every hour.

Now at this point, you need to start thinking about monitoring. How will you make sure that your replication is working properly? Usually, by running sanoid --monitor-snapshots on each box locally. That will make sure that all monitored datasets (as configured in /etc/sanoid/sanoid.conf) have snapshots as recent as they should have, according to policy as defined.

While you’re at it, run sanoid --monitor-health on all three boxes as well: it’ll tell you whether your pool has a bad disk in it, or similar lower level issues.

Once you get tired of (or realize you will forget to) shelling in regularly to all three boxes to run those commands locally, you can look at either setting up Nagios (the pro version of monitoring) or simply using a service like healthchecks.io to verify that your --monitor-snapshots and --monitor-health jobs returned an OK status.

Does that answer your questions?

hyperboly · June 18, 2025, 1:12am

Sorry for the very late reply, I was busy with exams.

I’ve set up site A->B->C correctly now following your instructions. For monitoring I’m using Zabbix so I’ll just set up a script that hopefully doesn’t take me 3 hours.

Thanks for the very thorough response, genuinely helped a lot!