This morning, Brian Behlendorf closed the long standing bug reporting occasional corruption when replicating encrypted datasets using raw send.
This bug’s final dissection and fix were the result of a coordinated community effort, and I’m proud of our own community’s part in that.
Cheers everybody!
opened 01:30PM - 08 May 21 UTC
closed 04:55PM - 19 May 25 UTC
Type: Defect
Component: Encryption
Status: Triage Needed
### System information
Type | Version/Name
--- | ---
Distribution Name | D… ebian
Distribution Version | Buster
Linux Kernel | 5.10.0-0.bpo.5-amd64
Architecture | amd64
ZFS Version | 2.0.3-1~bpo10+1
SPL Version | 2.0.3-1~bpo10+1
### Describe the problem you're observing
Since upgrading to 2.0.x and enabling crypto, every week or so, I start to have issues with my zfs send/receive-based backups. Upon investigating, I will see output like this:
```
zpool status -v
pool: rpool
state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
scan: scrub repaired 0B in 00:03:37 with 0 errors on Mon May 3 16:58:33 2021
config:
NAME STATE READ WRITE CKSUM
rpool ONLINE 0 0 0
nvme0n1p7 ONLINE 0 0 0
errors: Permanent errors have been detected in the following files:
<0xeb51>:<0x0>
```
Of note, the `<0xeb51>` is sometimes a snapshot name; if I `zfs destroy` the snapshot, it is replaced by this tag.
Bug #11688 implies that zfs destroy on the snapshot and then a scrub will fix it. For me, it did not. If I run a scrub **without rebooting** after seeing this kind of `zpool status` output, I get the following in very short order, and the scrub (and eventually much of the system) hangs:
```
[393801.328126] VERIFY3(0 == remove_reference(hdr, NULL, tag)) failed (0 == 1)
[393801.328129] PANIC at arc.c:3790:arc_buf_destroy()
[393801.328130] Showing stack for process 363
[393801.328132] CPU: 2 PID: 363 Comm: z_rd_int Tainted: P U OE 5.10.0-0.bpo.5-amd64 #1 Debian 5.10.24-1~bpo10+1
[393801.328133] Hardware name: Dell Inc. XPS 15 7590/0VYV0G, BIOS 1.8.1 07/03/2020
[393801.328134] Call Trace:
[393801.328140] dump_stack+0x6d/0x88
[393801.328149] spl_panic+0xd3/0xfb [spl]
[393801.328153] ? __wake_up_common_lock+0x87/0xc0
[393801.328221] ? zei_add_range+0x130/0x130 [zfs]
[393801.328225] ? __cv_broadcast+0x26/0x30 [spl]
[393801.328275] ? zfs_zevent_post+0x238/0x2a0 [zfs]
[393801.328302] arc_buf_destroy+0xf3/0x100 [zfs]
[393801.328331] arc_read_done+0x24d/0x490 [zfs]
[393801.328388] zio_done+0x43d/0x1020 [zfs]
[393801.328445] ? zio_vdev_io_assess+0x4d/0x240 [zfs]
[393801.328502] zio_execute+0x90/0xf0 [zfs]
[393801.328508] taskq_thread+0x2e7/0x530 [spl]
[393801.328512] ? wake_up_q+0xa0/0xa0
[393801.328569] ? zio_taskq_member.isra.11.constprop.17+0x60/0x60 [zfs]
[393801.328574] ? taskq_thread_spawn+0x50/0x50 [spl]
[393801.328576] kthread+0x116/0x130
[393801.328578] ? kthread_park+0x80/0x80
[393801.328581] ret_from_fork+0x22/0x30
```
However I want to stress that this backtrace is not the original **cause** of the problem, and it only appears if I do a scrub without first rebooting.
After that panic, the scrub stalled -- and a second error appeared:
```
zpool status -v
pool: rpool
state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
scan: scrub in progress since Sat May 8 08:11:07 2021
152G scanned at 132M/s, 1.63M issued at 1.41K/s, 172G total
0B repaired, 0.00% done, no estimated completion time
config:
NAME STATE READ WRITE CKSUM
rpool ONLINE 0 0 0
nvme0n1p7 ONLINE 0 0 0
errors: Permanent errors have been detected in the following files:
<0xeb51>:<0x0>
rpool/crypt/debian-1/home/jgoerzen/no-backup@[elided]-hourly-2021-05-07_02.17.01--2d:<0x0>
```
I have found the solution to this issue is to reboot into single-user mode and run a scrub. Sometimes it takes several scrubs, maybe even with some reboots in between, but eventually it will clear up the issue. If I reboot before scrubbing, I do not get the panic or the hung scrub.
I run this same version of ZoL on two other machines, one of which runs this same kernel version. What is unique about this machine?
- It is a laptop
- It uses ZFS crypto (the others use LUKS)
I made a significant effort to rule out hardware issues, including running several memory tests and the built-in Dell diagnostics. I believe I have rules that out.
### Describe how to reproduce the problem
I can't at will. I have to wait for a spell.
### Include any warning/errors/backtraces from the system logs
See above
### Potentially related bugs
- I already mentioned #11688 which seems similar, but a scrub doesn't immediately resolve the issue here
- A quite similar backtrace also involving `arc_buf_destroy` is in #11443. The behavior described there has some parallels to what I observe. I am uncertain from the discussion what that means for this.
- In #10697 there are some similar symptoms, but it looks like a different issue to me
9 Likes
That is fantastic news! Hope it finds its way into a release so it gets proper wide spread testing before the freeze of my daily driver distro
1 Like
Sweet! Heard about this on 2.5 admins and have been catching up on the GitHub comments. Awesome work by all of those involved.
1 Like
rdw
May 22, 2025, 6:28am
4
This is great news. I hit this error within days of upgrading from FreeNAS to TrueNAS and eventually had to revert to FreeNAS to keep the system reliable. As FreeNAS aged I tried Ubuntu before landing on Debian to get encryption options besides broken ZFS native encryption.
I never imaged it would take four full years to diagnose and fix. Hats off to the people who had the knowledge and will to reproduce the issue, diagnose, and fix it.