ZFS Bookmarks - I don't get it

I’ve read Curious as to the utility of ZFS bookmarks and some other stuff on the internet, and something isn’t making it through to my brain here.

So you have a source and a destination.

  • On Monday, on the source you make a pool called tank, and create a file called ONE
    You snapshot it, and bookmark it, and send the snapshot to the destination
  • On Tuesday, you repeat the process
  • On Wednesday, you create a file called TWO, snapshot and bookmark on the source only
  • On Thursday, you delete TWO and create THREE, snapshot and bookmark on the source only
  • On Friday, delete THREE, snapshot and bookmark on the source, then delete the Monday, Tuesday and Wednesday snapshots.
Day Src Snap Src bookmark Destination Snap Dest Bookmark
Mon x x x
Tue x x x
Wed x
Thu x x
Fri x x

So now it’s the weekend and we want to catch the destination up to the source.
They have no snapshot in common. Apparently, you do it with the bookmarks.

The big question … How the heck can this work? TWO and THREE no longer exist on the source. There’s no snapshot holding that data. I don’t get how they can be resurrected. Do we end up with Mon, Tue, Fri snapshots and just file ONE on the dest and whatever other changes happened in the middle are gone?

Smaller question - Do I understand correctly, syncoid will use the bookmarks automatically if they exist and are helpful?

As I’ve written this out (thanks for being my rubber duckie) I guess I could just go do and see, but I’d still like to ask how to understand the how better.

The big question … How the heck can this work? TWO and THREE no longer exist on the source. There’s no snapshot holding that data. I don’t get how they can be resurrected.

They can’t.

ZFS uses transaction groups (txgs) when doing it’s COW magic. Think of it as a protocol of what happend: “at txg 12345, we replaced block 2000 with new data, that data is stored in block 3000”

A snapshot has an associated txg when it was created, and also makes sure no blocks are getting overwritten. In the above example, it would make sure that block 2000 cannot be reallocated, even though the data in there is currently not used anymore.

A bookmark on the other hand only stores the txg and the name, that’s why they are lightweight. They don’t prevent data being overwritten, so they take up almost no space at all.

You can only create a bookmark from an existing snapshot, and you can only create an incremental send stream between a bookmark as the start and a snapshot as the end, but not between two bookmarks. In other words, zfs send -i zroot#foo zroot@bar is valid, zfs send -i zroot#foo zroot#bar isn’t.

Since the “target” of an incremental send is always a snapshot, we can be sure that all blocks referenced by that still exist.

So if a snapshot foo has createdtxg 12345, and snapshot bar has createdtxg 23456, All that’s needed is that all txgs from 12345 to 23456 and the contents of all the blocks the txgs have written to are saved into the stream. We actually don’t need any of the blocks at the point in time when foo was created, we only need all blocks in bar to still exist when we create the incremental stream.

AIUI, if you were to hack zfs send that it would allow you to (incorrectly) allow a bookmark as a target, it would technically create a stream, but that stream might contain garbage. Just imagine that in our example txg 13000 wrote something to block 9001; the stream would then include the data that block 9001 currently holds. With a snapshot as a target, we can be sure that 9001 still contains the correct data, with a bookmark, a later txg might have reused 9001 and stored some other data in there.

tl;dr: bookmarks can only be used as a “source” and they are lightweight because they do not prevent blocks that are currently used to be overwritten.

Step one: take snapshot @0 on source
Step two: ZFS send @0 to the target–full replication
Step three: take snapshot @1 on source
Step four: ZFS send -I @0 @1 to the target–incremental replication

Now let’s talk about that -I. What we’re telling ZFS is “the target already has @0, so just send the blocks present in @1 that weren’t present in @0.” And, crucially, the way it knows which blocks are which is because TXGs are sequential–so every block newer than the txg of @0 but older than or equal to the txg of @1.

Okay, so now we’ve successfully replicated @1 to the target. Let’s pick back up.

Step five: zfs bookmark mypool/mydata@1 mypool/mydata#1
Step six: destroy snapshot @1, take snapshot @2
Step seven: replicate incrementally from #1 to @2

Okay, so how did this work? Well, AFTER replicating @1 to the target, we created a local bookmark of it, #1. This bookmark just tells us what TXG @1 was associated with–but it does not preserve all the blocks of @1, as the actual snapshot @1 did.

But that’s okay, because as long as the target still has @1 fully intact, we don’t need the blocks of @1 intact on the source–we just need that txg id! So we’re still saying “send all the blocks contained in @2 that are newer than @1’s TXG.” We’re still depending on the target to have all those blocks intact. We just don’t need the actual blocks on the source–just the TXG associated with the base snapshot that the target will use to incrementally receive.

You can’t delete the Monday snapshot unless you delete the bookmark first, and by doing so you are explicitly taking action which will break the synchronization mechanism.