Syncoid with no common snapshots

I have sanoid set up on a main server and a backup using basically the recommended settings. I have a “production” template on my main server, which keeps a bunch of hourlies and dailies, and then 6 montlies. My backup retains much more.

I have having problems with syncoid, and so it didn’t run for about a month, but sanoid kept running. I do not know why, by sanoid snapped 6 monthlies in the space of about 5 weeks. Does is snap a monthly when the server restarts or something?

The result is that when I went to run syncoid again, there were no common datasets anymore. The latest dailies and hourlies were obviously too new on my production side, but the problem is that since it snapped 6 monthlies in close succession, sanoid pruned all the monthlies that would have been in common with my backup.

I guess the only thing to do is to destroy the dataset and all the snapshots on the backup and start over?

I guess it is another question, but how should i make sure that sanoid only snaps monthlies if the previous monthly is more than a month old?

It is probably too late for your dataset now without starting over.

If you use a cryptographically secure hash algorithm, you can use nopwrite to send/receive just incremental blocks; it is sort of like deduplication at the receive level. However, this requires a nondefault hash algorithm.

Also, you can allow syncoid to create bookmarks, which allows for incremental send/receive without the sending side having the snapshot (if you think about it, the sending side does not need to have a copy of the common data; it only needs to know what is incremental after that snapshot; that is what a bookmark keeps track of).

I do not know why, by sanoid snapped 6 monthlies in the space of about 5 weeks. Does is snap a monthly when the server restarts or something?

No, if a system is under a significant amount of storage load, it can end up taking multiples of any given snapshot period. There’s a very nasty race condition implicit in “access the heavily loaded storage in order to find out whether we need to take a snapshot of the heavily loaded storage”, and since Sanoid runs from cron rather than as a daemon, it has limited methods for inter-process communication. So when presented with the nasty race condition, Sanoid opts to take the safer gamble–take a snapshot, even though it might mean more than one is taken–rather than the IMO riskier one, which could end up in snapshots not being taken when they should be.

since it snapped 6 monthlies in close succession, sanoid pruned all the monthlies that would have been in common with my backup.

Nope, this is an understandable error, but a conceptual error nonetheless. Sanoid only prunes a snapshot if BOTH of the following conditions are true:

  • snapshot is older than its period’s policy dictates for retention
  • there are more than n snapshots of that periodicity, where n is the policy retention dictate

So, if you’ve set monthlies=6 and have a weird thing happen and take 100 monthly snapshots in the space of ten minutes… you will have >100 monthly snapshots until six months later, when they will all be pruned at the same time due to being more than six months old (and to having more than six total snapshots on the system).

I guess the only thing to do is to destroy the dataset and all the snapshots on the backup and start over?

Afraid so. You can’t incrementally replicate without a common snapshot.

I guess it is another question, but how should i make sure that sanoid only snaps monthlies if the previous monthly is more than a month old?

Again, that doesn’t have any real bearing here. But the better question to answer is “how do I make sure this doesn’t happen again?” and there are a couple of answers.

First: stop using --no-sync-snapshot, especially if you’re only replicating to a single target (and thus don’t need to worry about “foreign” sync snapshots getting replicated into a target where they’ll never be automatically destroyed). That’s half the point of the default sync snapshot: the first half is “make sure this replication gets ALL of the data currently on that dataset” and the second half is “by both creating and managing its own sync snapshots, syncoid won’t ever break the replication chain, because there will always be a common snapshot.”

Second: start doing automated monitoring of your backups. This is stupid easy; you literally don’t have to do anything but run “sanoid --monitor-snapshots” on the backup target. I wrap that up in a Nagios check; I’ve seen other people tie it into healthchecks.io. In addition to the informational text that command spits out (which gets hoovered up into Nagios or healthchecks.io notifications), it exits with exit code 0 (OK), 1 (WARN), or 2 (CRIT). Nagios uses this to issue alerts; put that together with a mobile client like aNag (Android) or easyNag (iOS) and you can literally have your phone just vibrate in your pocket any time you’re experiencing a backup issue, same day.

Note that Sanoid running on the backup server with a backup (or hotspare) policy template will already have built-in grace periods for monitoring. They’re user-alterable, you can see them right there in the default config. But essentially, the backup template assumes once-daily replication, and the hotspare assumes once-hourly replication, so if you’re running the backup template you’ll get a WARN if it’s been more than 48 hours since your most recent new Sanoid snapshots replicated in, and a CRIT if it’s been more than 60.

If you’re using the hotspare template, you’ll get WARN after 4h and CRIT after 6h.

Either way, once you’ve set up proper automated monitoring, you’ll be in good shape and won’t end up sandbagged by a problem you didn’t know about for months on end.

1 Like

Merry Christmas, @mercenary_sysadmin!!

Okay, that is good to know. Thanks!

Hmmm well that is not what happened. On my system, on my “production” side, I had 6 monthlies, about 5 weeks old. Then on my backup, I had 6 monthlies dating back a little over 6 months. So it looks like it was pruning the backup correctly, since it left the most recent 6 monthlies even though one of them was more than 6 months old. But it was not doing that on my production side.

I am replicating on multiple targets, so I do not like this option. What is wrong with using bookmarks as suggested above? I added that to my syncoid jobs since it seems like it is better than sync-snaps when replicating to multiple targets.

I understand this, and will set it up. However, it wasn’t an issue of me not knowing that it wasn’t syncing. Instead, syncoid was causing the very race condition problem that made this a problem, so I had purposely stopped running syncoid in cron. Then I had to go out of town for a couple weeks and when I came back, I saw this issue. So the problem was not that I didn’t know that there were no backups being synced, it was that I assumed that even if not syncing, I would have a at least a few months of common snapshots for backups when I got back. According to you, I should have. But for some reason I did not. Hopefully the bookmarks will solve this from now on.

I went ahead and deleted to backup filesystem and started again from the snapshots I had on production. I was not super worried because this was my offsite backup and I had plenty of snapshots on my onsite backup. I just wanted to know why. Apparently it as not supposed to do that.

Like I said before, I still the issue about syncoid in cron crashing my server, but I have not fully pinned that down so have not asked about it yet. I will try to track it down and then ask. The problem is that since it crashes my server, I am not really super excited about trying to get it to happen.