Corruption in encrypted send bug finally closed!

This morning, Brian Behlendorf closed the long standing bug reporting occasional corruption when replicating encrypted datasets.

This bug’s final dissection and fix were the result of a coordinated community effort, and I’m proud of our own community’s part in that.

Cheers everybody! :clinking_glasses::tada:

13 Likes

That is fantastic news! Hope it finds its way into a release so it gets proper wide spread testing before the freeze of my daily driver distro :partying_face:

1 Like

Sweet! Heard about this on 2.5 admins and have been catching up on the GitHub comments. Awesome work by all of those involved.

1 Like

This is great news. I hit this error within days of upgrading from FreeNAS to TrueNAS and eventually had to revert to FreeNAS to keep the system reliable. As FreeNAS aged I tried Ubuntu before landing on Debian to get encryption options besides broken ZFS native encryption.

I never imaged it would take four full years to diagnose and fix. Hats off to the people who had the knowledge and will to reproduce the issue, diagnose, and fix it.

1 Like

And a shoutout to @HankB for the contribution :clinking_beer_mugs:

1 Like

I’ve been meaning to let Mr. Salter know about this. Months ago I pointed out that I mentioned syncoid in the issue and thought he should be aware (and not because I thought syncoid was the problem.) I’m happy to see that he already knows about this.

syncoid provoked the bug in my home lab so I set about producing scripts that anyone could use to provoke corruption in the hope that someone with sufficient knowledge of ZFS internals could use them to work their way toward a fix. I’ve done enough coding to know that the first thing that needs to be done to tackle a bug is to be able to reliably reproduce it. I suspect that lack of this piece was the reason that this bug had remained unsolved for so long. And of course, syncoid was an important part of these scripts.

I’m thrilled that my efforts have led to a solution. I’m a huge fan of ZFS and it rankled me every time someone pointed to this issue (aside from the fact that I experienced it myself.) I’m really happy that others have been able to use my stuff to fix this issue.

4 Likes

Yes they did, and I made certain to credit you by name when we covered the closing of this bug in the most recent 2.5 Admins episode, 2.5 Admins 248: NASty Pi – 2.5 Admins.

I had a sneaking suspicion that my Dec 22 2024 mention that the one major difference that might exist between syncoid and other replication efforts was the way syncoid manually walks the dataset tree rather than using native recursion might have led you to your Dec 26 2024 reproducer script, and I did mention that on the episode as well, but I gave you the well-deserved lion’s share of the credit because you are correct, nothing was really getting done until you successfully produced a reproducer that functioned in a reasonable amount of time.

Paul Dagnelie and others from Allan Jude’s team at Klara did, from my understanding, a ton of the actual in-kernel dev work, along with upstream devs including but not limited to George Amanakis, but initial, non kernel developer work done by our community here at Practical ZFS was crucial to getting this resolved, and I’m damned proud of us for it.

While we’re giving out credit, Richard Yao also created a cool automated check to discover similar bugs which might exist elsewhere in the codebase or be accidentally introduced in the future; and Klara clients fastmail.com, rsync.net, and nber.org donated some of their support budget with Klara to fund those efforts.

As I said on episode 248, this bug really took a community to fix, not just a few elite kernel developers, and that makes this a really shining Cinderella story for open source software in general, not to mention the actual specific annoying as hell and long-term bug that’s finally resolved.

(plus a little friendly ribbing for the btrfs devs)

(Also, I don’t have to listen to any more shit from disgruntled btrfs devs who enjoyed scapegoating OpenZFS with that bug anymore. I’m typing that with a friendly grin, mind you, btrfs devs… and I’m rooting for y’all, I seriously am. But I’m still typing it, grin and rooting or not. :slight_smile:)

2 Likes