I’ve been using syncoid in combination with systemd timers to backup my ZFS pools across three servers. For a handful of datasets and pools this is fine, but once I started having different pools per server and replicating Proxmox containers datasets (that are automatically named by Proxmox), the setup is becoming a bit of a nightmare.
For the non-proxmox datasets, I create syncoid-{push/pull}-{source}-{target}@.{timer,service} files. The more pools I have the more files I have to manage. Then to enable a replication job, I run systemctl enable syncoid-...@{dataset}.timer.
The Proxmox datasets include the VM/Container ID and name in the dataset path (e.g., pool/proxmox/subvol-109-disk-0), so I have to manually craft a .service file to have the container hostname other meaningful bits on the replication side dataset name.
The overhead of keeping these files tidy, making sure they’re reloaded, and tracking which datasets have replication enabled is getting tedious.
I really like how /etc/sanoid/sanoid.conf is all I need to maintain to manage snapshots. But for syncoid, I have to manage several different files and manually enable timers. Am really thinking about writing my own tool to solve this, but before I go and do that I’d like to know if someone has already solved this problem. How are people here managing syncoid?
If there have been previous discussions on managing syncoid sources and targets, feel free to link to them.
I personally find systemd to be a bit baroque and over-the-top for my own scheduling needs–I just use crontabs, which are pretty readable and simple even with say ten or twenty discrete replication jobs to run, IMO.
At the point where I got to feeling like “that’s just too damn many lines on a crontab” I would just write a simple shell script, and invoke the shell script from cron instead. But that really only makes sense for jobs that you’re running at the same time (or firing off in series from the same time), not for multiple jobs with their own schedules–because by the time you have to write down the schedules, you really can’t get it any simpler than cron in the first place.
Eg, this crontab I just pulled from a real hotspare server:
There really just isn’t any tighter way to keep track of scheduled jobs that I know of. Sure, there’s no clickety-click GUI, if that’s a concern… but it doesn’t sound like that’s a concern, given that you’re already writing your own systemd timers.
I think the worst-case scenario I currently have for this is a dedicated disaster recovery server that has something like thirty to fifty different replication jobs in its cron. Which I could agree is enough to start worrying about missing an error… but like I said, I’m not aware of any better way to keep track of the information, applications aside. You have to know what you’re doing and when you’re doing it, right? That’s the only thing in the cron job!
Having thought about it over the last couple days, the cron + syncoid combination is very hard to beat! Not to say there aren’t any limitations, but there’s a lot of value in using simple general purpose tools.
But, now that you mention it, maybe a “clickety-click GUI” would actually be a good thing. Something that can report if snapshots are falling too far behind, when the last successful sync was, allow manually running a sync, etc. I might have just reinvented TrueNAS, but without the vendor lock in.
The PVE-zsync wiki page mentions pvesr which seems to be closer to what I want. I don’t need to replicate the containers I use, only the datasets (non-rootfs). But from what I can tell, PVE-zsync and pvesr need both sides to be Proxmox, which doesn’t fit all my use cases.
There’s a long range project to deliver all that clickety-click stuff, but I can’t in good conscience advise you to wait for that. In the meantime, I recommend what I do myself–which, like I said, is just syncoid run from crontab.
and Bob’s your uncle. (You can also use this with literally ANY command. Welcome to the Unix Philosophy, where we attempt to keep our tools as simple and widely useful as possible!)
You’ll probably want some kind of logging as well; that should be pretty trivial to add however you’d like it. Personally, I don’t care to monitor exit codes from syncoid, because what I monitor is the freshness of snapshots on the target. If the snapshots are there, the replication of them succeeded. If they’re not, it didn’t.
But I’m going to use the retry command instead. I have max-retries and exponential back off in my systemd service, and the retry command allows me replicate that. The latest release even adds jitter, but not in the debian repositories yet, so will have to wait for that.
Bonus: The code looks like it is only using posix apis, so should compile for FreeBSD!
Yeah, it’s the kind of thing that’s easy not to know about if you just… already have the concept of quickly and easily scripting it. It’s sort of like a machinist not bothering to buy pre-made jigs because it’s so quick to make your own that it just doesn’t seem worth bothering.
Except in this case, it’s more like the machinist (me) not looking through their own junk drawer to find a really nice factory-made jig perfect for the job, so.
There is no need to have multiple timer files. Systemd supports running several commands in sequence in a single service file. Here is what I’ve configured on my offsite backup:
# /etc/systemd/system/syncoid.service
[Unit]
Description=Send ZFS snapshots
Requires=local-fs.target
After=local-fs.target
After=sanoid.service
[Service]
Type=oneshot
Environment=TZ=UTC
# Multiple execstart works with oneshot
ExecStart=/usr/sbin/syncoid --recursive --exclude-datasets=docker --preserve-recordsize homeserver1:zroot ztank/backup/homeserver1/zroot
ExecStart=/usr/sbin/syncoid --recursive --exclude-datasets=docker --preserve-recordsize homeserver1:zrust ztank/backup/homeserver1/zrust
# healthcheck needs to be last
ExecStart=/usr/bin/curl -fsS -m 10 --retry 5 -o /dev/null https://hc-ping.com/UUID-REDACTED
I’m a little sensitive to datasets that more than a day to sync. I’d rather have independent timers and have my datasets try to sync independently instead of one waiting for another to finish. Plus individual services work better with StartLimitBurst and StartLimitIntervalSec.
But your approach works for syncs that don’t take too long. I will take the intermediate step of moving all the small syncs to a single service. The vast majority of my syncs are going to complete in a few minutes, so that’s still a huge management win.
Makes sense. I am relying on daily syncs being fast because not much changes from day to day. The initial sync took close to 24 hours on the local LAN before I moved it offsite
i switched from cron to a systemd timer. systemd does a lot more than cron out of the box, like running a task after it couldn’t run at the specified time (blocked by dependencies or offline) and it also prevents running a task twice simultaneously, which you’d otherwise use flock for. it also has built-in logging (no “tee” or “cat” needed), allows human-readable schedules like “weekly”, and easily hooks into your other systemd units like requiring ZFS to be active (prevents issues) or triggering a failure unit like my global ntfy/healthchecks.io units.
as for the actual task, i just use the recursive flag on my two pools and have “sync:no” attributes (might be named differently, idk right now) on all datasets i don’t want to sync.
works nicely and can recommend, but of course it requires more learning systemd than just learning crontab.
Only two jobs is the reason you’re happier with systemd timers. Too many jobs, making it a pain in the ass to see and manage is why the first person wasn’t happy with systemd timers.