One final note, about the Redditor who complained about constant disk thrashing with recordsize=1M:
4x4TB HDDs configured in a RAIDZ1
Welp. That’s a single poorly-sized rust vdev. You can’t divide 1MiB by three and come up with an even number of sectors. If you want to torrent to RAIDz (not mirrors) you’ll want even LARGER recordsizes (so each DISK gets 1MiB at a time, ideally) and you’ll want those vdevs to be an even size, so that you’re not wasting IOPS and bandwidth alike on padding.
If that were a 3x4TB HDD RAIDz1, recordsize=1M would mean each drive got a 512KiB write, which is still pretty decent (although recordsize=2M would be better, to bring us back to the 1MiB writes that I get on mirror pools with recordsize=1M).
A 6-wide Z2 with recordsize=4M would also work well, and for the same reason. (Although no single RAIDz vdev is going to compete well with the same number of drives in multiple mirror vdevs.)
I suspected that the cache should be able to buffer the reading of 16 KB chunks, but since I’m not very familiar with how ARC works in details, I wasn’t sure. I have no doubt that recordsize=16K would significantly reduce performances of sequential reads due to fragmentation.
The reported “read amplification” problems must therefore very likely be due to something else. Possibly a mis-configured RAIDz as you hinted, or even maybe bugs in the Bittorrent client.
I might run a few tests comparing different parameters, it’ll be good practice for me to tinker with ZFS.
Thank you for the very in depth write up on read amplification. It has been incredibly helpful.
Would it be possible to also touch on write amplification in similar cases? For example, on a dataset with recordsize=1M, if a torrent you were downloading was downloading incredibly slow (let’s say an odd number like 37 kb/s), would this cause every 37 kb write to trigger a 1M rmw?
My immediate guess would be that this wouldn’t happen, since datasets seem to be be able to write in dynamically sized blocks between the ashift (assumedly 12, ie 4k here) and the recordsize (1M). But I’m not sure if my assumptions are correct in this case. I assume that something like this would also be affected if the download directory were on a NFS share for example, though maybe setting the share to async at the NFS level with sync at the ZFS level would fix that.
It also seems like this might be a per client tuning thing, as the client’s write coalescing and caching probably comes into play. Would that be the best place to begin fixing such rmw write amplification? Would it also be a bigger issue for SSD mirrors, as opposed to raidz pools of spinning rust?
Yes and no, because undersized blocks can only be created to store files or metadata objects smaller than a single full sized block.
If you save a 16KiB file in a dataset with rs=1M, you get a single undersized block 4 sectors wide (before parity or redundancy, and assuming ashift=12 (4K sector size). If you append 16KiB to a 1MiB file in the same dataset, it requires two 1MiB blocks.
As long as a file in that dataset remains open, if you append 16KiB of data, it just stays buffered until either you accumulate the rest of that 1MiB, or until the file is closed, or until the application that opened the file forces a sync.
If you close the file or force a sync, the 16KiB gets written out into a new 1MiB block with only 16KiB of data. (The wasted remainder of the block is called “slack” space.) The next time you write new data to that file, the slack block is read in (if not already in cache, which it will already be in cache if this is an ongoing operation), the new data appended, and written out to a new 1MiB block replacing the old one. Unless a snapshot was taken after the first version of the block was written and before the new one could be, the earlier version is immediately unlinked (so no wasted space).
I’m not entirely certain if 16KiB of dirty data would make it to disk in a slack block on an open file with no forced sync after it’s txg_timeout old (default 5sec on modern OpenZFS, 120sec by default on much older versions, adjustable either way) but this really isn’t an issue in real practice, since it means you’re limited to potentially one extra slack block written every txg_timeout seconds.
Final note: with any kind of compression at all enabled–including ZLE, which compresses nothing but slack space–even those “extra” potential “1M” block writes only occupy the sectors necessary to store the data–so you aren’t writing a new 1MiB out every time the final block of the file gets rewritten, only as many sectors as the actual data part of the slack block requires.
The only part of this I’m not certain of is whether the 16K of dirty data that’s older than txg_timeout will automatically be flushed. It might not be, in which case that data remains dirty (not flushed) until file close, forced sync, or enough data to fill a block joins it. Either way, you’re still better off–MUCH better off–than you would be with 64x as many blocks per file, which means you’re ABSOLUTELY eating 64x the IOPS.
Remember, the torrent is written random access, not sequential. So with 64x as many blocks that were written out of sequence even when readahead buffering you end up with roughly 64x the fragmentation and IOPS necessary. (Sequentially written files on that same 16K dataset will rarely be heavily fragmented unless the pool itself is very full and therefore has only heavily fragmented free space to write to). This is why copying a torrented file even to the same dataset with the same small recordsize tends to effectively camouflage the underlying issue.