How can I speed up zfs raw send/recv of encrypted datasets?

I’m backing up my zfs datasets to a remote machine with an HDD based zpool (2-HDD mirror). My goal would be to use raw sends, as I use zfs encryption, and I really like the idea of the backup target machine not needing to know the decryption password.
Sadly, such receives are extremely slow, probably due to mostly random writes and the low IOPS of HDDs?
Using non-raw sends/receives is way faster, at times saturating the gigabit link. Whereas the raw send/recv is in the order of 1MB/s.
Is there anything I can do to accelerate raw transfers, apart from switching to SSDs?
Or is there even anything that can be changed in the zfs raw-send/recv implementation that would allow a more sequential workload on the receiving end?
As it currently stands, it is basically unusable. It would be faster to just use restic to backup the complete block device instead of sending a snapshot.

I need to get used to this discourse-style discussion, with comments and answers… Comments have a character limit, so this doesn’t fit. And I can’t seem to edit my original post, which is what I would have done on stackoverflow.

Anyways, here are the details you asked for:
I’m using syncoid. mbuffer and lzop are installed on both sides and are used by syncoid. Source dataset is on NVMe.
All reported transfer rates are what pv displays, that is before lzop compression. That’s why the non-raw sends are sometimes faster than the 1Gbps link between the machines.

Raw send

Command is: syncoid --no-sync-snap --sendoptions=w --recvoptions=u source_dataset target_host:target_dataset.
which, for the incremental part that is slow, expands to

 zfs send -w  -I 'zroot/crypt/arch'@'after-transfer' 'zroot/crypt/arch'@'autosnap_2023-07-14_05:35:59_daily' | pv -p -t -e -r -b -s 93610541312 | lzop  | mbuffer  -q -s 128k -m 16M 2>/dev/null | ssh     -S /tmp/syncoid-lianli-1689359252 lianli ' mbuffer  -q -s 128k -m 16M 2>/dev/null | lzop -dfc | sudo zfs receive -u  -s -F '"'"'big/t460s_arch_tmp'"'"''

The “sending oldest full snapshot” runs at good speed. Not line speed, but 62MiB/s average over 17GiB.

The “updating new target filesystem with incremental” is running fluctuating speed, from ~400KiB/s up to few MiB/s. There are very short bursts where it reaches 75MiB/s. iostat on the target system shows 100% utilization on both disks in the mirror, which tells me this is IOPS related.

I don’t have the patience to let it finish, with 87GB to transfer at those speeds. Cancelled at 1.34GiB transferred after 3min 10sec (for an average of 7MiB/s). Yes, in an earlier run I have let it run for way longer than that, it did not get any faster.

Non-Raw send

Command is: syncoid --no-sync-snap --recvoptions=u source_dataset target_host:target_dataset.
Oh by the way, compression (lz4) is also involved, as with a non-raw send the first snapshot size is 31GB, not 17GB.
“Sending oldest full snapshot” ran with an average rate of 114MiB/s (pre-compression, but roughly twice the speed for twice the size of the raw compressed snapshot)
“updating new target filesystem with incremental” is running speeds between 25MiB/s and > 200MiB/s, sometimes dipping to ~ 5MiB/s. Also mostly limited by IOPS on target, as iostat utilization is again close to 100% on both disks in the receiving mirror.

perf top

This is during the incremental raw send:

Samples: 3M of event 'cycles', 4000 Hz, Event count (approx.): 9009590329 lost: 0/0 drop: 0/0
Overhead  Shared Object                                   Symbol
  11.90%  [kernel]                                        [k] delay_tsc
   6.21%  [kernel]                                        [k] LZ4_compressCtx
   4.39%  [kernel]                                        [k] iowrite8
   4.08%  [kernel]                                        [k] ioread8
   2.74%  [kernel]                                        [k] native_queued_spin_lock_slowpath.part.0
   2.51%  [kernel]                                        [k] SHA512TransformBlocks
   1.56%  [kernel]                                        [k] asm_exc_nmi
   1.42%  [kernel]                                        [k] memcpy_erms
   1.32%  [kernel]                                        [k] rtl8169_interrupt
   0.78%  [kernel]                                        [k] mutex_lock
   0.75%  [kernel]                                        [k] memset_erms
   0.73%  [kernel]                                        [k] menu_select
   0.69%  [kernel]                                        [k] _raw_spin_lock
   0.62%                                [.] lzo_adler32
   0.62%  [kernel]                                        [k] fletcher_4_sse2_native
   0.60%  [kernel]                                        [k] psi_group_change
   0.60%  [kernel]                                        [k] LZ4_compress64kCtx
   0.55%  [kernel]                                        [k] copy_user_enhanced_fast_string
   0.52%  [kernel]                                        [k] native_write_msr
   0.51%  [kernel]                                        [k] SHA256TransformBlocks
   0.50%  [kernel]                                        [k] kmem_cache_free
   0.47%  [kernel]                                        [k] kmem_cache_alloc
   0.45%  [kernel]                                        [k] mutex_unlock
   0.40%  [kernel]                                        [k] avl_walk
   0.39%  perf                                            [.] dso__find_symbol
   0.39%  perf                                            [.] rb_next
   0.38%  [kernel]                                        [k] __kmalloc_node

So I guess the delay_tsc hints at the CPU waiting for I/O to be completed. I’m not sure what LZ4_compressCtx is doing there, as I would assume that no compression needs to take place when receiving a raw snapshot that also already is lz4-compressed.
If you can get more interesting information out of this perf run, let me know!
But to me it still looks like either too much random writing, or too much reading in between the writes.
Also I can hear the disks, they are definitely busy 100% of the time, and always seeking.

Have you tried without the lzop ? For raw encrypted data, I can’t imagine it would bring too much benefit anyway. Raw encrypted data is theoretically relatively incompressible (not entirely, but it does not perform well under compression).

Also, to be honest, I would advise against using ZFS native encryption right now (unless you really cannot trust the recipient). There are multiple ongoing bugs (data will sometimes be written unencrypted, snapshots can become corrupted, etc.) that put your data at risk (@rincebrain cataloged these on other platforms as well). ZFS over LUKS is the much more standard and battle-tested way of encrypting at rest. You lose the advantage of sending the data encrypted, but the risk of corruption is not worth it, in my opinion (especially if you are in control of the recipient device).