Corruption in raw send bug finally closed!

mercenary_sysadmin · May 19, 2025, 7:25pm

This morning, Brian Behlendorf closed the long standing bug reporting occasional corruption when replicating encrypted datasets using raw send.

This bug’s final dissection and fix were the result of a coordinated community effort, and I’m proud of our own community’s part in that.

Cheers everybody!

github.com/openzfs/zfs

ZFS corruption related to snapshots post-2.0.x upgrade

opened 01:30PM - 08 May 21 UTC

closed 04:55PM - 19 May 25 UTC

jgoerzen

Type: Defect Component: Encryption Status: Triage Needed

### System information Type | Version/Name --- | --- Distribution Name | D…ebian Distribution Version | Buster Linux Kernel | 5.10.0-0.bpo.5-amd64 Architecture | amd64 ZFS Version | 2.0.3-1~bpo10+1 SPL Version | 2.0.3-1~bpo10+1 ### Describe the problem you're observing Since upgrading to 2.0.x and enabling crypto, every week or so, I start to have issues with my zfs send/receive-based backups. Upon investigating, I will see output like this: ``` zpool status -v pool: rpool state: ONLINE status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A scan: scrub repaired 0B in 00:03:37 with 0 errors on Mon May 3 16:58:33 2021 config: NAME STATE READ WRITE CKSUM rpool ONLINE 0 0 0 nvme0n1p7 ONLINE 0 0 0 errors: Permanent errors have been detected in the following files: <0xeb51>:<0x0> ``` Of note, the `<0xeb51>` is sometimes a snapshot name; if I `zfs destroy` the snapshot, it is replaced by this tag. Bug #11688 implies that zfs destroy on the snapshot and then a scrub will fix it. For me, it did not. If I run a scrub **without rebooting** after seeing this kind of `zpool status` output, I get the following in very short order, and the scrub (and eventually much of the system) hangs: ``` [393801.328126] VERIFY3(0 == remove_reference(hdr, NULL, tag)) failed (0 == 1) [393801.328129] PANIC at arc.c:3790:arc_buf_destroy() [393801.328130] Showing stack for process 363 [393801.328132] CPU: 2 PID: 363 Comm: z_rd_int Tainted: P U OE 5.10.0-0.bpo.5-amd64 #1 Debian 5.10.24-1~bpo10+1 [393801.328133] Hardware name: Dell Inc. XPS 15 7590/0VYV0G, BIOS 1.8.1 07/03/2020 [393801.328134] Call Trace: [393801.328140] dump_stack+0x6d/0x88 [393801.328149] spl_panic+0xd3/0xfb [spl] [393801.328153] ? __wake_up_common_lock+0x87/0xc0 [393801.328221] ? zei_add_range+0x130/0x130 [zfs] [393801.328225] ? __cv_broadcast+0x26/0x30 [spl] [393801.328275] ? zfs_zevent_post+0x238/0x2a0 [zfs] [393801.328302] arc_buf_destroy+0xf3/0x100 [zfs] [393801.328331] arc_read_done+0x24d/0x490 [zfs] [393801.328388] zio_done+0x43d/0x1020 [zfs] [393801.328445] ? zio_vdev_io_assess+0x4d/0x240 [zfs] [393801.328502] zio_execute+0x90/0xf0 [zfs] [393801.328508] taskq_thread+0x2e7/0x530 [spl] [393801.328512] ? wake_up_q+0xa0/0xa0 [393801.328569] ? zio_taskq_member.isra.11.constprop.17+0x60/0x60 [zfs] [393801.328574] ? taskq_thread_spawn+0x50/0x50 [spl] [393801.328576] kthread+0x116/0x130 [393801.328578] ? kthread_park+0x80/0x80 [393801.328581] ret_from_fork+0x22/0x30 ``` However I want to stress that this backtrace is not the original **cause** of the problem, and it only appears if I do a scrub without first rebooting. After that panic, the scrub stalled -- and a second error appeared: ``` zpool status -v pool: rpool state: ONLINE status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A scan: scrub in progress since Sat May 8 08:11:07 2021 152G scanned at 132M/s, 1.63M issued at 1.41K/s, 172G total 0B repaired, 0.00% done, no estimated completion time config: NAME STATE READ WRITE CKSUM rpool ONLINE 0 0 0 nvme0n1p7 ONLINE 0 0 0 errors: Permanent errors have been detected in the following files: <0xeb51>:<0x0> rpool/crypt/debian-1/home/jgoerzen/no-backup@[elided]-hourly-2021-05-07_02.17.01--2d:<0x0> ``` I have found the solution to this issue is to reboot into single-user mode and run a scrub. Sometimes it takes several scrubs, maybe even with some reboots in between, but eventually it will clear up the issue. If I reboot before scrubbing, I do not get the panic or the hung scrub. I run this same version of ZoL on two other machines, one of which runs this same kernel version. What is unique about this machine? - It is a laptop - It uses ZFS crypto (the others use LUKS) I made a significant effort to rule out hardware issues, including running several memory tests and the built-in Dell diagnostics. I believe I have rules that out. ### Describe how to reproduce the problem I can't at will. I have to wait for a spell. ### Include any warning/errors/backtraces from the system logs See above ### Potentially related bugs - I already mentioned #11688 which seems similar, but a scrub doesn't immediately resolve the issue here - A quite similar backtrace also involving `arc_buf_destroy` is in #11443. The behavior described there has some parallels to what I observe. I am uncertain from the discussion what that means for this. - In #10697 there are some similar symptoms, but it looks like a different issue to me

hernil · May 19, 2025, 8:53pm

That is fantastic news! Hope it finds its way into a release so it gets proper wide spread testing before the freeze of my daily driver distro

SirGeorge · May 21, 2025, 11:30pm

Sweet! Heard about this on 2.5 admins and have been catching up on the GitHub comments. Awesome work by all of those involved.

rdw · May 22, 2025, 6:28am

This is great news. I hit this error within days of upgrading from FreeNAS to TrueNAS and eventually had to revert to FreeNAS to keep the system reliable. As FreeNAS aged I tried Ubuntu before landing on Debian to get encryption options besides broken ZFS native encryption.

I never imaged it would take four full years to diagnose and fix. Hats off to the people who had the knowledge and will to reproduce the issue, diagnose, and fix it.