I do not know why, by sanoid snapped 6 monthlies in the space of about 5 weeks. Does is snap a monthly when the server restarts or something?
No, if a system is under a significant amount of storage load, it can end up taking multiples of any given snapshot period. There’s a very nasty race condition implicit in “access the heavily loaded storage in order to find out whether we need to take a snapshot of the heavily loaded storage”, and since Sanoid runs from cron rather than as a daemon, it has limited methods for inter-process communication. So when presented with the nasty race condition, Sanoid opts to take the safer gamble–take a snapshot, even though it might mean more than one is taken–rather than the IMO riskier one, which could end up in snapshots not being taken when they should be.
since it snapped 6 monthlies in close succession, sanoid pruned all the monthlies that would have been in common with my backup.
Nope, this is an understandable error, but a conceptual error nonetheless. Sanoid only prunes a snapshot if BOTH of the following conditions are true:
- snapshot is older than its period’s policy dictates for retention
- there are more than n snapshots of that periodicity, where n is the policy retention dictate
So, if you’ve set monthlies=6 and have a weird thing happen and take 100 monthly snapshots in the space of ten minutes… you will have >100 monthly snapshots until six months later, when they will all be pruned at the same time due to being more than six months old (and to having more than six total snapshots on the system).
I guess the only thing to do is to destroy the dataset and all the snapshots on the backup and start over?
Afraid so. You can’t incrementally replicate without a common snapshot.
I guess it is another question, but how should i make sure that sanoid only snaps monthlies if the previous monthly is more than a month old?
Again, that doesn’t have any real bearing here. But the better question to answer is “how do I make sure this doesn’t happen again?” and there are a couple of answers.
First: stop using --no-sync-snapshot, especially if you’re only replicating to a single target (and thus don’t need to worry about “foreign” sync snapshots getting replicated into a target where they’ll never be automatically destroyed). That’s half the point of the default sync snapshot: the first half is “make sure this replication gets ALL of the data currently on that dataset” and the second half is “by both creating and managing its own sync snapshots, syncoid won’t ever break the replication chain, because there will always be a common snapshot.”
Second: start doing automated monitoring of your backups. This is stupid easy; you literally don’t have to do anything but run “sanoid --monitor-snapshots” on the backup target. I wrap that up in a Nagios check; I’ve seen other people tie it into healthchecks.io. In addition to the informational text that command spits out (which gets hoovered up into Nagios or healthchecks.io notifications), it exits with exit code 0 (OK), 1 (WARN), or 2 (CRIT). Nagios uses this to issue alerts; put that together with a mobile client like aNag (Android) or easyNag (iOS) and you can literally have your phone just vibrate in your pocket any time you’re experiencing a backup issue, same day.
Note that Sanoid running on the backup server with a backup (or hotspare) policy template will already have built-in grace periods for monitoring. They’re user-alterable, you can see them right there in the default config. But essentially, the backup template assumes once-daily replication, and the hotspare assumes once-hourly replication, so if you’re running the backup template you’ll get a WARN if it’s been more than 48 hours since your most recent new Sanoid snapshots replicated in, and a CRIT if it’s been more than 60.
If you’re using the hotspare template, you’ll get WARN after 4h and CRIT after 6h.
Either way, once you’ve set up proper automated monitoring, you’ll be in good shape and won’t end up sandbagged by a problem you didn’t know about for months on end.