Hello,
I use syncoid to raw send encrypted datasets to an off site backup via a push. Like many others, this causes my source server to kernel panic and have corruption in the local snapshots.
I have observed that if I switch to a “pull mode” IE the off-site target initiates the backup, my source server does not crash (still has corrupted snapshots). However, I don’t like the idea of the off site server being able to ssh into my source server. I played with setting up an ssh-chroot environment and copying in the zfs and syncoid binaries and shared libraries, but I was nervous about the implications of this. I do think this solved my security concerns.
If I were to do this are there dangers in the host system updating a shared library, but the ssh chroot environment having an older version?
HankB
September 25, 2024, 11:26pm
2
This is not normal. In your shoes, I would be working on solving this problem. I’m also curious about who “many others” are as I’ve not heard this.
As far as security. I cannot comment on your chroot strategy. For remote backups I’ve switched to “pull” so it is not possible to manipulate the backups on the host being backed up (that I’m aware of.)
Initially this was using SSH with passwords disabled and openings in my firewall restricted to the remote host IP address. I’m now using Tailscale between two hosts running Debian 12.
I did experience hangs a year or two ago that seem to be related to ZFS but upgrading to Bookworm has resolved that. The hangs did interfere with anything that required disk I/O but did not result in a full blown kernel panic. Over a period of a year or so I experienced this hang several times.
2 Likes
I’d love to solve the root cause but AFAIK this is a known issue with encrypted raw sends. Here are a few of the open issues.:
opened 07:45PM - 10 May 22 UTC
Type: Defect
### System information
Type | Version/Name
--- | ---
Distribution Name | Deb… ian
Distribution Version | Bullseye
Kernel Version | 5.10.109+truenas
Architecture | amd64
OpenZFS Version | zfs-2.1.2-95_g1d2cdd23b zfs-kmod-2.1.2-95_g1d2cdd23b
<!--
Command to find OpenZFS version:
zfs version
Commands to find kernel version:
uname -r # Linux
freebsd-version -r # FreeBSD
-->
### Describe the problem you're observing
During an incremental receive, ZFS caused a panic and a system hangup.
### Describe how to reproduce the problem
It happens randomly.
### Include any warning/errors/backtraces from the system logs
```
[456678.240841] VERIFY3(0 == dmu_object_set_blocksize(rwa->os, drro->drr_object, drro->drr_blksz, drro->drr_indblkshift, tx)) failed (0 == 95)
[456678.243815] PANIC at dmu_recv.c:1776:receive_object()
[456678.245141] Showing stack for process 2936808
[456678.246532] CPU: 10 PID: 2936808 Comm: receive_writer Tainted: P OE 5.10.109+truenas #1
[456678.247840] Hardware name: Supermicro X9QR7-TF+/X9QRi-F+/X9QR7-TF+/X9QRi-F+, BIOS 3.0b 05/20/2015
[456678.249138] Call Trace:
[456678.250421] dump_stack+0x6b/0x83
[456678.251676] spl_panic+0xd4/0xfc [spl]
[456678.253038] ? arc_buf_access+0x14c/0x250 [zfs]
[456678.254276] ? dnode_hold_impl+0x4e9/0xef0 [zfs]
[456678.255493] ? dnode_set_blksz+0x13b/0x300 [zfs]
[456678.256677] ? dnode_rele_and_unlock+0x5c/0xc0 [zfs]
[456678.257846] receive_object+0xc2c/0xca0 [zfs]
[456678.258984] ? dmu_object_next+0xd6/0x120 [zfs]
[456678.260098] ? receive_writer_thread+0xbd/0xad0 [zfs]
[456678.261160] ? kfree+0x40c/0x480
[456678.262202] ? _cond_resched+0x16/0x40
[456678.263244] receive_writer_thread+0x1cc/0xad0 [zfs]
[456678.264280] ? thread_generic_wrapper+0x62/0x80 [spl]
[456678.265252] ? kfree+0x40c/0x480
[456678.266242] ? receive_process_write_record+0x190/0x190 [zfs]
[456678.267177] ? thread_generic_wrapper+0x6f/0x80 [spl]
[456678.268092] thread_generic_wrapper+0x6f/0x80 [spl]
[456678.268988] ? __thread_exit+0x20/0x20 [spl]
[456678.269864] kthread+0x11b/0x140
[456678.270706] ? __kthread_bind_mask+0x60/0x60
[456678.271538] ret_from_fork+0x22/0x30
```
opened 07:45PM - 02 Mar 21 UTC
Type: Defect
Component: Send/Recv
Component: Encryption
Status: Triage Needed
<!--
Thank you for reporting an issue.
*IMPORTANT* - Please check our issue … tracker before opening a new issue.
Additional valuable information can be found in the OpenZFS documentation
and mailing list archives.
Please fill in as much of the template as possible.
-->
### System information
Type | Version/Name
--- | ---
Distribution Name | Debian
Distribution Version | 10 buster
Linux Kernel | 4.19.0-14 (SMP Debian 4.19.171-2)
Architecture | amd64
ZFS Version | 2.0.3-1~bpo10+1
SPL Version | 2.0.3-1~bpo10+1
<!--
Commands to find ZFS/SPL versions:
modinfo zfs | grep -iw version
modinfo spl | grep -iw version
-->
### Describe the problem you're observing
When I start sending raw ZFS snapshots to a different system, my Linux systen (4.19.0-14-amd64) starts to hang completely. I can ping it, I can start a very commands (such as dmesg) but most commands hang (incl zfs, zpool, htop, ps, ...). The entire systems hangs completely.
Dmesg shows the following entries at the time of the occurance:
```
[ 2293.134071] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
[ 2293.149707] PGD 0 P4D 0
[ 2293.154752] Oops: 0000 [#1] SMP PTI
[ 2293.161701] CPU: 1 PID: 12576 Comm: receive_writer Tainted: P OE 4.19.0-14-amd64 #1 Debian 4.19.171-2
[ 2293.182517] Hardware name: Supermicro X10SLL-F/X10SLL-F, BIOS 3.0a 12/21/2015
[ 2293.196819] RIP: 0010:abd_verify+0x5/0x60 [zfs]
[ 2293.205865] Code: 0f 1f 44 00 00 0f 1f 44 00 00 8b 07 c1 e8 05 83 e0 01 c3 66 90 0f 1f 44 00 00 8b 07 c1 e8 06 83 e0 01 c3 66 90 0f 1f 44 00 00 <8b> 07 a8 01 74 01 c3 a8 40 74 43 41 54 4c 8d 67 68 55 53 48 8b 47
[ 2293.243325] RSP: 0018:ffffb12e4b6d7a28 EFLAGS: 00010246
[ 2293.253741] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
[ 2293.267974] RDX: 0000000000004000 RSI: 0000000000004000 RDI: 0000000000000000
[ 2293.282205] RBP: 0000000000004000 R08: ffff935ec10b70b0 R09: 0000000000000000
[ 2293.296434] R10: 0000000000007130 R11: ffff935d75f984e0 R12: 0000000000004000
[ 2293.310664] R13: 0000000000000000 R14: ffffffffc0fea550 R15: 0000000000000020
[ 2293.324900] FS: 0000000000000000(0000) GS:ffff935ecfb00000(0000) knlGS:0000000000000000
[ 2293.341053] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2293.352510] CR2: 0000000000000000 CR3: 000000001340a001 CR4: 00000000000606e0
[ 2293.366743] Call Trace:
[ 2293.371704] abd_borrow_buf+0x12/0x40 [zfs]
[ 2293.380104] abd_borrow_buf_copy+0x28/0x70 [zfs]
[ 2293.389377] zio_crypt_copy_dnode_bonus+0x36/0x130 [zfs]
[ 2293.400041] arc_buf_fill+0x3ff/0xb60 [zfs]
[ 2293.408449] ? zfs_btree_add_idx+0xd0/0x200 [zfs]
[ 2293.417889] arc_untransform+0x1c/0x70 [zfs]
[ 2293.426461] dbuf_read_verify_dnode_crypt+0xec/0x160 [zfs]
[ 2293.437466] dbuf_read_impl.constprop.29+0x4ad/0x6b0 [zfs]
[ 2293.448423] ? kmem_cache_alloc+0x167/0x1d0
[ 2293.456776] ? __cv_init+0x3d/0x60 [spl]
[ 2293.464671] ? dbuf_cons+0xa7/0xc0 [zfs]
[ 2293.472497] ? spl_kmem_cache_alloc+0x108/0x7a0 [spl]
[ 2293.482583] ? _cond_resched+0x15/0x30
[ 2293.490071] ? _cond_resched+0x15/0x30
[ 2293.497542] ? mutex_lock+0xe/0x30
[ 2293.504402] ? aggsum_add+0x17a/0x190 [zfs]
[ 2293.512810] dbuf_read+0x1b2/0x520 [zfs]
[ 2293.520672] ? dnode_hold_impl+0x350/0xc20 [zfs]
[ 2293.529904] dmu_bonus_hold_by_dnode+0x126/0x1a0 [zfs]
[ 2293.540186] receive_object+0x403/0xc70 [zfs]
[ 2293.548906] ? receive_freeobjects.isra.10+0x9d/0x120 [zfs]
[ 2293.560049] receive_writer_thread+0x279/0xa00 [zfs]
[ 2293.569962] ? set_curr_task_fair+0x26/0x50
[ 2293.578319] ? receive_process_write_record+0x190/0x190 [zfs]
[ 2293.589793] ? __thread_exit+0x20/0x20 [spl]
[ 2293.598317] ? thread_generic_wrapper+0x6f/0x80 [spl]
[ 2293.608410] ? receive_process_write_record+0x190/0x190 [zfs]
[ 2293.619882] thread_generic_wrapper+0x6f/0x80 [spl]
[ 2293.629609] kthread+0x112/0x130
[ 2293.636053] ? kthread_bind+0x30/0x30
[ 2293.643351] ret_from_fork+0x35/0x40
[ 2293.650473] Modules linked in: ipt_REJECT nf_reject_ipv4 xt_multiport iptable_filter veth pci_stub vboxpci(OE) vboxnetadp(OE) vboxnetflt(OE) nf_tables nfnetlink vboxdrv(OE) bridge binfmt_misc zfs(POE) zunicode(POE) zzstd(OE) zlua(OE) zavl(POE) icp(POE) zcommon(POE) znvpair(POE) spl(OE) intel_rapl x86_pkg_temp_thermal intel_powerclamp ipmi_ssif coretemp kvm_intel kvm irqbypass crct10dif_pclmul ib_iser joydev crc32_pclmul rdma_cm ghash_clmulni_intel iw_cm intel_cstate ib_cm intel_uncore ib_core intel_rapl_perf configfs ipmi_si sg ipmi_devintf iTCO_wdt iTCO_vendor_support pcc_cpufreq intel_pch_thermal iscsi_tcp ipmi_msghandler libiscsi_tcp libiscsi evdev scsi_transport_iscsi pcspkr tun nfsd auth_rpcgss nfs_acl lockd grace sunrpc lm85 dme1737 hwmon_vid iptable_nat ipt_MASQUERADE nf_nat_ipv4 nf_nat nf_conntrack
[ 2293.793008] nf_defrag_ipv6 nf_defrag_ipv4 fuse loop 8021q garp stp mrp llc ecryptfs ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 fscrypto ecb crypto_simd cryptd glue_helper aes_x86_64 raid10 uas usb_storage raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_generic raid0 multipath linear hid_generic raid1 usbhid hid md_mod sd_mod ast ahci ttm libahci libata drm_kms_helper drm crc32c_intel igb i2c_i801 dca i2c_algo_bit scsi_mod lpc_ich mfd_core e1000e xhci_pci ehci_pci xhci_hcd ehci_hcd usbcore usb_common thermal fan video button
[ 2293.895677] CR2: 0000000000000000
[ 2293.902280] ---[ end trace 164c64ca87be80af ]---
[ 2294.020926] RIP: 0010:abd_verify+0x5/0x60 [zfs]
[ 2294.029975] Code: 0f 1f 44 00 00 0f 1f 44 00 00 8b 07 c1 e8 05 83 e0 01 c3 66 90 0f 1f 44 00 00 8b 07 c1 e8 06 83 e0 01 c3 66 90 0f 1f 44 00 00 <8b> 07 a8 01 74 01 c3 a8 40 74 43 41 54 4c 8d 67 68 55 53 48 8b 47
[ 2294.067433] RSP: 0018:ffffb12e4b6d7a28 EFLAGS: 00010246
[ 2294.077850] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
[ 2294.092082] RDX: 0000000000004000 RSI: 0000000000004000 RDI: 0000000000000000
[ 2294.106312] RBP: 0000000000004000 R08: ffff935ec10b70b0 R09: 0000000000000000
[ 2294.120542] R10: 0000000000007130 R11: ffff935d75f984e0 R12: 0000000000004000
[ 2294.134774] R13: 0000000000000000 R14: ffffffffc0fea550 R15: 0000000000000020
[ 2294.149006] FS: 0000000000000000(0000) GS:ffff935ecfb00000(0000) knlGS:0000000000000000
[ 2294.165144] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2294.176600] CR2: 0000000000000000 CR3: 000000001340a001 CR4: 00000000000606e0
```
Interestingly, the transfer continues happily but just everything else in the system hangs.
The only way to recover is resetting the machine (since not even reboot works).
### Describe how to reproduce the problem
It's a tough one. It seems to me that the issue might be load related in some sense since it only occurs if I have two zfs send's (via syncoid) running in parallel that have to do with encrypted datasets.
#### Transfer 1
The first one sends datasets from an unecrypted dataset into an encrypted one (I migrate to encryption).
I use syncoid and use the command:
`syncoid -r --skip-parent --no-sync-snap zpradix1imain/sys/vz zpradix1imain/sys/vz_enc`
This translates into
`zfs send -I 'zpradix1imain/sys/vz/main'@'zfs-auto-snap_hourly-2021-03-02-1917' 'zpradix1imain/sys/vz/main'@'zfs-auto-snap_frequent-2021-03-02-1932' | mbuffer -q -s 128k -m 16M 2>/dev/null | pv -s 16392592 | zfs receive -s -F 'zpradix1imain/sys/vz_enc/main'`
#### Transfer 2
I transfer data from an encrypted dataset raw to a secondary server.
The syncoid command is:
`syncoid -r --skip-parent --no-sync-snap --sendoptions=w --exclude=zfs-auto-snap_hourly --exclude=zfs-auto-snap_frequent zpradix1imain/data root@192.168.200.12:zpzetta/radix/data`
This translates into:
`zfs send -w 'zpradix1imain/data/home'@'vicari-prev' | pv -s 179222507064 | lzop | mbuffer -q -s 128k -m 16M 2>/dev/null | ssh ...`
In summary:
- Both Transfer 1 and Transfer 2 have to be running in parallel
- The issue only appears a few minutes/seconds after I started the transfer
opened 01:30PM - 08 May 21 UTC
Type: Defect
Component: Encryption
Status: Triage Needed
### System information
Type | Version/Name
--- | ---
Distribution Name | D… ebian
Distribution Version | Buster
Linux Kernel | 5.10.0-0.bpo.5-amd64
Architecture | amd64
ZFS Version | 2.0.3-1~bpo10+1
SPL Version | 2.0.3-1~bpo10+1
### Describe the problem you're observing
Since upgrading to 2.0.x and enabling crypto, every week or so, I start to have issues with my zfs send/receive-based backups. Upon investigating, I will see output like this:
```
zpool status -v
pool: rpool
state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
scan: scrub repaired 0B in 00:03:37 with 0 errors on Mon May 3 16:58:33 2021
config:
NAME STATE READ WRITE CKSUM
rpool ONLINE 0 0 0
nvme0n1p7 ONLINE 0 0 0
errors: Permanent errors have been detected in the following files:
<0xeb51>:<0x0>
```
Of note, the `<0xeb51>` is sometimes a snapshot name; if I `zfs destroy` the snapshot, it is replaced by this tag.
Bug #11688 implies that zfs destroy on the snapshot and then a scrub will fix it. For me, it did not. If I run a scrub **without rebooting** after seeing this kind of `zpool status` output, I get the following in very short order, and the scrub (and eventually much of the system) hangs:
```
[393801.328126] VERIFY3(0 == remove_reference(hdr, NULL, tag)) failed (0 == 1)
[393801.328129] PANIC at arc.c:3790:arc_buf_destroy()
[393801.328130] Showing stack for process 363
[393801.328132] CPU: 2 PID: 363 Comm: z_rd_int Tainted: P U OE 5.10.0-0.bpo.5-amd64 #1 Debian 5.10.24-1~bpo10+1
[393801.328133] Hardware name: Dell Inc. XPS 15 7590/0VYV0G, BIOS 1.8.1 07/03/2020
[393801.328134] Call Trace:
[393801.328140] dump_stack+0x6d/0x88
[393801.328149] spl_panic+0xd3/0xfb [spl]
[393801.328153] ? __wake_up_common_lock+0x87/0xc0
[393801.328221] ? zei_add_range+0x130/0x130 [zfs]
[393801.328225] ? __cv_broadcast+0x26/0x30 [spl]
[393801.328275] ? zfs_zevent_post+0x238/0x2a0 [zfs]
[393801.328302] arc_buf_destroy+0xf3/0x100 [zfs]
[393801.328331] arc_read_done+0x24d/0x490 [zfs]
[393801.328388] zio_done+0x43d/0x1020 [zfs]
[393801.328445] ? zio_vdev_io_assess+0x4d/0x240 [zfs]
[393801.328502] zio_execute+0x90/0xf0 [zfs]
[393801.328508] taskq_thread+0x2e7/0x530 [spl]
[393801.328512] ? wake_up_q+0xa0/0xa0
[393801.328569] ? zio_taskq_member.isra.11.constprop.17+0x60/0x60 [zfs]
[393801.328574] ? taskq_thread_spawn+0x50/0x50 [spl]
[393801.328576] kthread+0x116/0x130
[393801.328578] ? kthread_park+0x80/0x80
[393801.328581] ret_from_fork+0x22/0x30
```
However I want to stress that this backtrace is not the original **cause** of the problem, and it only appears if I do a scrub without first rebooting.
After that panic, the scrub stalled -- and a second error appeared:
```
zpool status -v
pool: rpool
state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
scan: scrub in progress since Sat May 8 08:11:07 2021
152G scanned at 132M/s, 1.63M issued at 1.41K/s, 172G total
0B repaired, 0.00% done, no estimated completion time
config:
NAME STATE READ WRITE CKSUM
rpool ONLINE 0 0 0
nvme0n1p7 ONLINE 0 0 0
errors: Permanent errors have been detected in the following files:
<0xeb51>:<0x0>
rpool/crypt/debian-1/home/jgoerzen/no-backup@[elided]-hourly-2021-05-07_02.17.01--2d:<0x0>
```
I have found the solution to this issue is to reboot into single-user mode and run a scrub. Sometimes it takes several scrubs, maybe even with some reboots in between, but eventually it will clear up the issue. If I reboot before scrubbing, I do not get the panic or the hung scrub.
I run this same version of ZoL on two other machines, one of which runs this same kernel version. What is unique about this machine?
- It is a laptop
- It uses ZFS crypto (the others use LUKS)
I made a significant effort to rule out hardware issues, including running several memory tests and the built-in Dell diagnostics. I believe I have rules that out.
### Describe how to reproduce the problem
I can't at will. I have to wait for a spell.
### Include any warning/errors/backtraces from the system logs
See above
### Potentially related bugs
- I already mentioned #11688 which seems similar, but a scrub doesn't immediately resolve the issue here
- A quite similar backtrace also involving `arc_buf_destroy` is in #11443. The behavior described there has some parallels to what I observe. I am uncertain from the discussion what that means for this.
- In #10697 there are some similar symptoms, but it looks like a different issue to me
I’ve fully replaced my disks and no change in the behavior. The corruption is only ever in snapshots and I have both local and offsite backups so I havent been too worried. I’m open to trying fixes but these issues seem to not just be on my system.
My main concern with “pull” is if my remote backup system is compromised they have a tunnel into my home server. With “push” there would be no keys for the offsite system to reach back into my LAN.
I was curious about the dangers (if any) of having the same zfs pool in a chroot and native environment since a chroot ssh jail seemed like a good way to wall off the remote.
HankB
September 26, 2024, 12:55am
4
Thanks for the links. I actually commented on that last one.
I don’t use raw sends but I had seen the “permanent errors” in the snapshots. I never experienced a kernel panic but some of the dumps look familiar.
Full pool backups were the only operation that provoked the errors so I disabled those. A couple days ago there was a suggestion that 2.2.5 or 2.2.6 had the encryption related issues resolved. I’ve turned them back on and so far there are no permanent errors. Still too soon to declare victory though.
1 Like
The only time I’ve experienced corruption (not related to a failing disk) is when using ZFS on a USB-attached disk. The other thing I experienced with these USB disks was various ZFS operations would just hang forever. The machine was still online and functioning, though, just the ZFS commands were stuck, so sounds like a little different from your experience.
For doing backups across the internet, I’ve used Zerotier with great success. It’s quite similar to Tailscale as mentioned by @HankB . This way I can SSH between hosts without having their SSH port exposed to the public internet.