I don’t know this gentleman, but it makes a lot of sense to plan and understand a system. That is why I appreciate your answers here so much. Just sounds like a lot of feelings…
I understand your example (at least I hope I do), but are there not two things to consider here? 1) the optimal layout, 2) the total amount of data disks vs. parity disks. If I were to go with 2 6-wide RAIDz2 VDEVs, I would have bought 12 disks and “lost” 4 for parity. So there are 8 data disks total. If I went with a single 12 wide RAIDz2, I would only spend 2 disks on parity and still have 10 disks for data.
So overall, would the 12 wide RAIDz2 still have a worse efficiency, even though I would be “tanking” some non-optimal configurations with two additional data disks?
Also, how much of a concern is the poor efficiency of wide VDEVs if I have a larger block size?
And is blocksize the same as recordsize, or is blocksize the number of data chunks split from the equation recordsize / number of data disks?
If you’ve got two 6-wide Z2 vdevs, you’ve got eight drives’ worth of data, with no padding necessary, and (assuming recordsize is at least 16K) nearly all of your data will be able to stretch across the entire stripe. Your metadata blocks will only have 33% storage efficiency, as will any single-sector files. Your two and three sector files will also take a small hit to the naively expected 67% SE, since they will be at 50% (two data sectors + two parity sectors) and 40% (three data sectors + two parity sectors), respectively. But for most datasets, the vast majority will be a nice clean 67% SE, no fuss no muss.
If you do a twelve-wide Z2, though, every full-stripe write will require some padding, and every file smaller than 40KiB will be stored in a narrower stripe than the naive expectation. Ignoring the (relatively) small files for now and looking at the padding, you’ll wind up with inefficiency roughly like this:
128KiB block / 10 data drives == 12.8KiB per drive
12.8KiB / 4KiB/sector == 3.2 sectors per drive
You can’t store data in “0.2 sectors”, so that means you’re really storing data in 4 sectors per drive. 3.2 sectors / 4 sectors == 0.8 == 80%, so you’re only 80% efficient there on top of the 10/12 == 83.3% SE you were normally expecting. Put those together, and you’re at 67% SE… the exact same 67% SE you had with two six-wide Z2, but now with worse performance and more uncertainty about what you’ve actually got available to you.
Why is the performance worse? Well, you went from two vdevs to one, which means you cut your IOPS in half. On top of that, you’re storing less data per individual drive, which means each individual drive suffers from considerably worse performance and get fragmented more quickly.
Can you survive all that? Sure. Would you have been better off with the pair of six-wide? Absolutely; they’re simpler, significantly higher performance, less confusing, AND more resistant to failure (since you can lose not only any two disks, but up to four disks if you’re lucky about which ones you lose).
Also, how much of a concern is the poor efficiency of wide VDEVs if I have a larger block size?
The larger the blocksize, the smaller the performance issues. But you can’t get away from the fact that your IOPS are worse. The larger blocksize improves the performance per individual drive, but doesn’t do anything about the larger performance issues.
And is blocksize the same as recordsize
OpenZFS devs don’t ever talk about records, and mind-bogglingly enough, tend to get a little confused if YOU talk about records. To an OpenZFS dev (and you’ll see this if you ever look at the code), everything is stored in blocks. “Recordsize” is the tunable parameter that allows you to set the size of a block in a dataset.
That’s not the only kind of block–if you use ZVOLs, you’ll set blocksize in them using the volblocksize property. Volblocksize is essentially just like recordsize, but it’s for zvols not filesystem datasets, and unlike recordsize it’s immutable. You have to set volblocksize at zvol creation time, and once the zvol has been created, you can’t modify the volblocksize.
he number of data chunks split from the equation recordsize / number of data disks?
Unfortunately, OpenZFS devs don’t actually have terminology for what you’re reaching after here. In mdraid, what you’re describing is “chunk size”. OpenZFS doesn’t really concern itself directly with chunks the way mdraid does, because OpenZFS doesn’t write partially-empty stripes.
Mdraid–or other forms of conventional RAID–will store, for example, a 4KiB file on a 6-wide RAID6 just like it would store a 128KiB file… split into pieces across the entire stripe width. If, later, you want to store a second 4KiB file, the RAID controller will slide that file into the empty space around the first 4KiB file in the same stripe, recalculate the parity, and you actually haven’t used any additional room when you store the second file (but you were INCREDIBLY wasteful when you stored the first one, since it ate an entire stripe worth of storage space).
OpenZFS, on the other hand, has dynamic stripe width. So it isn’t as concerned with chunk size, because it will simply store smaller files in a narrower stripe (as described above, talking about eg 4KiB files on a Z2). Since it never has to rewrite a stripe with additional data crammed in (I hypothesize) its devs never really had to be so concerned with the specific amount of data per disk that they actually set a term for it.
So, yeah, technically there is no such thing as “chunk size” on RAIDz. Where on an mdraid6 you might specifically set chunk size to 256KiB in order to write up to 256KiB to each disk in a single stripe (opening the door for reading / modifying / recaluclating parity / rewriting even when multiple files are stored on the same DISK in the same stripe!), in ZFS you simply split any given block up as evenly as possible amongst all available drives, UNLESS the file is too small to need the entire stripe, in which case it’s written to a single sector per drive in the stripe.
Man, you are a legend! It is really complicated, but after reading it 5 times (and understanding a little more each time) I think I understand what you are telling me here.
I think you mean here (three data sectors + two parity sectors) - please confirm.
I understand the performance side; two vdevs allow double the write performance (IOPS and sequential), right? The data still needs to be read from one VDEV, so there is no read performance benefit. I would add more mirrors for that. I hope that was correct, but please correct me if I am wrong.
But what do you mean by the other part? That I have a higher chance of not losing data because I have more parity disks?
Read or write IOPS?
Good point! I never thought of that!
Because the padding is not needed for most of the blocks and only for the last non-optimal sector?
Makes sense, because I miss the second VDEV.
So to sum up, a 12 wide RAIDz2 has the same SE as 2 6 wide RAIDz2 but the 2x6 combination has better performance for the single disks and because its two VDEVs AND better failure guarantees (4 parity disks). I just see more raw capacity on the 12 wide while effectively they are the same?
So how would you structure a 15 slot storage server? I have 2 NVME disks for SLOG device. 2x6 wide RAIDz2, 1 spare and two additional mirrored SATA SSDs as metadata VDEV?
What is the best layout for RAIDz3? I assume 5 wide, 7 wide, 11 wide or 19 wide?
For any RAIDz, the optimal widths are 2+p, 4+p, and 8+p, where p is the parity level 1, 2, or 3. So, yes, ideal widths for RAIDz3 are 5, 7, and 11.
19 wide is absolutely batshit, yes. In order to even take advantage of a stripe that wide, you’d need an absolutely MASSIVE recordsize, which would in turn present some pretty large latency issues on anything but massive files which needed to be written to and read from in their entirety, not in smaller pieces. And the actual returns you get on storage efficiency even if you do all that right are just vanishingly small.
Even without taking all the potential ways to lose it all to inefficiency, the naive SE expectation for 11-wide Z3 is 8/11 == 73%. For 19-wide Z3 it’s 16/19 == 84%. It might be worthwhile if you could actually reap that entire 11%, but even then, only marginally because you’d still be saddled with increasingly terrible performance AND lower failure resistance. And for all the reasons we’ve already talked about, you’re not actually going to get that full expectation of an additional 11% efficiency, because it’s not going to help you any with undersized blocks that don’t occupy the full stripe width.
Personally, I’d do two-wide mirrors in 15 bays; seven plus if I was really itching to fill that fifteenth bay, a hotspare. But I value performance, fast resilvering, and ease of capacity upgrades well beyond the additional capacity you get with RAIDz2 (or, for that matter, the additional parity. Though the additional parity is the most tempting part).
You could opt instead for 5x3-wide Z1. That gets you most of the performance of mirrors, with even better storage efficiency than six-wide Z2, at the cost of a little less fault resistance than mirrors.
Or, the safe route: two 6-wide Z2, with or without a spare in the last bay.
I can’t really pick one of those for you directly; you have to answer for yourself which properties you value most. I do think those are the only three configs worth serious consideration for what I hear from your goals and 15 total bays, though.
If you decide to go mirrors, you might also consider NOT filling every single bay right up front. Four mirrors will get you more throughput than you know what to do with, and if that’s enough capacity to last you for a couple of years, you’ll have larger drives for less money per TiB available a few years later, and can just add more mirrors. Eventually, you start replacing the oldest mirrors–still getting capacity upgrades with every two drives you replace.
Thank you so much! You have helped me a lot to understand the topic. I think I will go with my non-optimal VDEV until it is 70% full and then I will swap my data to a new 6 wide RAIDz2 and then repurpose my old one to a 6 wide RAIDz2 as well.
I hope you feel refreshed after a few months of rest. I have two follow-up questions.
Does BitTorrent piece size have any effect on ZFS performance? I know only the torrent creator can set it, but if I were in a situation where I could theoretically change it, is there a good answer in terms of performance and ZFS?
Let’s say you want to go big, is there a feasible layout to get 10Gbps (10 not 1) throughput with spinning disks without sacrificing too much usable space? Not that I plan to do this, but I want to understand ZFS performance better. My beginner’s mind thinks that if I had a big server with 60 enterprise drives that randomly come out at 150 IOPS (realistic?) and I split them into 10 x 6-wide RAIDz2 vdevs, I would have 150 x 10 IOPS. Assuming the data is equally requested and I set a record size of 4M (meaning each data disk has 1M), would all the disks be able to provide 150 IOPS x 10 x 1M = 1500M. I know I am way off here, but I would like to know how the math works. If 1500M is the right approximate answer, then we would be in the 10Gbps range.
Correct, you can’t set the piece size as a BitTorrent consumer. But if you could, you’d get the best performance out of larger (>1MiB) piece size, matched to large (1M) recordsize. An even larger recordsize might repeat might get you somewhat higher performance in some cases, but in my experience you get very little gain from blocksize or recordsize larger than 1M, whereas the potential harm due to latency spikes continues to rise very well indeed. I do not generally recommend recordsize larger than 1MiB, especially without careful actual testing against the specific workload in play.
You’re on essentially the right track, just keep in mind that you’re talking rules of thumb, not hard numbers. You may not get exactly what you expect, but the scaling will generally follow the train of thought you laid out here. The biggest issue is that you may see significantly LOWER iops out of a raidz vdev than you see out of a single disk of the same class. What this means for your particular workload is something you’re not going to find out for sure without testing… And keep in mind, recordsize=1M on a six wide Z2 doesn’t mean writing 1M to each individual drive, it means writing 256KiB to each individual drive, which is not optimal for rust drive throughput.
If you’re serious about getting extremely high performance, you may need to go with mirrors. You can also try recordsize=4M, if your version of OpenZFS is modern enough to support it. That will put the amount of data written to individual drives be 1MiB at a time, which is optimal for rust… But depending on your workload, again, you may find that you’ve traded higher potential top end throughput with the higher recordsize for nasty latency spikes at that higher recordsize, which may or may not also result in worse throughput.
Essentially, I can give you recommendations and guidelines, but you want to operate at a scale that’s going to require some testing and potential adjustments, if you are genuinely serious about the need to saturate or near saturate 10Gbps or higher interfaces.
A (next to) last warning: at 10Gbps or more, the storage stack is not your only bottleneck. You also need a fast processor with a ton of PCIe lanes, and a workload that can manage multi threaded network connectivity (eg pNFS); most processors aren’t capable of breaking 5-6Gbps on a single CPU thread, in my experience, whether the storage stack itself is even involved.
You may also need to mess around with jumbo frames, which is another can of worms that can be a huge pain all on its own. Fair warning.
Why is it a good idea to match BitTorrent piece size and ZFS record size? I was thinking that BitTorrent’s piece size is just a balance between metadata overhead and sharing efficiency. Or is there more to the piece size that affects how ZFS stores the data?
I wanted to go with rs=4M so that each disk gets a 1M chunk. So the same thing you pointed out was just a misunderstanding. Great to hear you say that my thumb calc is the right approach. What random IOPS (which are essential for torrents) can I expect from enterprise hard drives in a 6-wide raidz2? Say Seagate Exos series.
For seeding, near-matching piece and block size is potentially valuable because it reduces read amplification (having to read a bunch of unnecessary data in order to get the data you need for a much smaller request).
Regardless of read amplification concerns, you want a large recordsize on stuff you’re acquiring via torrent, whether you later intend to seed or not. This is because the mechanics of the BitTorrent protocol tend to result in maximal fragmentation, so increased blocksize means random I/O at a larger blocksize when reading, and large block I/O offers significantly higher throughput than small block I/O does.
This is really a little too vague. Raidz performance varies pretty drastically according to workload. All I can really tell you is to expect similar to moderately lower IOPS than a single disk of the vdev would offer, and that you generally expect around 150-200 IOPS from most decent modern rust.
If your workload is essentially exclusively torrent acquisition and seeding, without further information I would cautiously expect one 6w Z2 vdev on rust at RS=4M to offer around 400-600 MiB/sec throughput, total (meaning read and write combined). Assuming a sufficiently parallel workload–which seeding certainly ought to be–I’d cautiously expect roughly linear scaling with additional matching vdevs.
This is all, of course, assuming you don’t hit a specific bottleneck somewhere, like (but not limited to) the issue with the maximum network throughput per individual CPU thread that I mentioned briefly above.
Pardon the late reply, but while reading through I decided to run an experiment with a torrent staging filesystem set to recordsize=1M and a destination filesystem set to recordsize=4M. I copied the ISO rather than re-running the torrent; I could run that experiment, too.
This is running on TrueNas SCALE 25.04.2.1 inside of a Proxmox VE 9 VM. My pool consists of a pair of 6-wide RAID-Z2 vdevs with an Optane 1600X SLOG and I used DVD 1 of Debian 13 as my source.
The VM was rebooted to purge L2ARC before I tested this.
edit: I also ran a test in which the staging directory used recordsize=4M as well.
edit2: reorganized and posted a few more datapoints
Directly torrented file on a recordsize=1M filesystem:
What is the difference between staging and destination here? Is staging a directly torrented file, and destination a copied file? If so, was staging a directly torrented file in both runs, or only in the 1M run?
Are you running nested ZFS, or is either proxmox or TrueNAS running with a different back end file system? How much RAM in the host, and how much in the TrueNAS VM?
I’ve added more datapoints and rearranged things above to hopefully make things more clear. I also reduced the pv progress bar width, too. staging indeed meant the directly torrented file, and destination that file copied to another zfs filesystem on the same pool.
The Proxmox VE host is currently using XFS on LVM as its backing filesystem, and this hosts the TrueNAS VM boot virtual disk on a SATA SSD. It has 64GB of RAM and a Xeon E-2226G 6-core CPU.
I am passing through an LSI 9300-8e HBA into TrueNAS. It is currently allocated 4 cores and 36GB of RAM. The disks are on an external 12-bay Areca JBOD fed by one SFF-8644 cable; I assumed that 4x12gbit SAS lanes and the JBOD’s onboard SAS expander should not be the limiting factor. The HBA has 8 PCIe lanes and is installed on a 16-lane PCIe slot.
In the fullness of benchmarking disclosure, I should note that this pool is very much not well-balanced, but there should be more than enough space available for a piddly 3.8GB ISO to be well-distributed in either case:
The HBA is not a significant bottleneck for you, but if you hadn’t used a solid HBA instead of just motherboard SATA, that would have been quite a significant bottleneck.
Most motherboard SATA–even on pricy server boards like Supermicro, Tyan, etc–bottleneck hard at around 700MiB/sec.
Let me preface this message by thanking @mercenary_sysadmin for all the detailed answers provided in this thread. These high-quality explanations are invaluable for gaining a better insight into ZFS, and they help me a lot to prepare my installation rightly.
However, I would like to clarify one point regarding the acceptable value of recordsize for Bittorrent seeding.
I’ve read several users complaining or warning about “read amplification” problems when combining ZFS/btrfs with long-time seeding. Here are a few discussions:
According to the responses, this behavior is attributed to a recordsize of 1M being considered too large for use cases such as torrenting, since the protocol implies many small and random read-write operations. OpenZFS also recommends 16K recordSize.
This surprises me because it doesn’t seem to align with what is advised here. I also found this other message from @mercenary_sysadmin on Reddit, which considers recordSize=1M as acceptable and recordSize=16K as outdated.
I’d like to clarify the following statements (from the Reddit thread):
In theory, they were imagining streamlining the workload of the torrent software itself, by trying to match recordsize to the individual pieces of the torrent. This would theoretically optimize seeding, though at the significant detriment of file playback.
But that doesn’t make sense either, because piece sizes themselves are almost always larger than 1MiB for anything big enough to bother torrenting. You want 1000-1500 pieces in a torrent; that means that even for a measly 1GiB torrent you’re looking at individual pieces of 1MiB or slightly larger.
And from the current thread:
For seeding, near-matching piece and block size is potentially valuable because it reduces read amplification (having to read a bunch of unnecessary data in order to get the data you need for a much smaller request).
These messages discuss the value of recordSize in relation to the torrent pieces size, but they do not talk about the “block” (aka “chunk”) size of the pieces, which is defined as “the smallest unit of data that is requested and transferred over the network”. Indeed, each torrent piece is subdivided into blocks, which are almost always 16KB. From what I understand, the peers request blocks, not pieces.
That means at the data level, the Bittorrent client reads blocks of 16KB from the disk, regardless of the torrent piece size. This can be confirmed by looking at the implementation of the Transmission application: the read_block() function (which calls tr_ioRead()) is used to fill a buffer of 1024 * 16 bytes.
Therefore, this got me confused about how RAIDz2 performances truly relates to (mis)match between recordsize and torrent piece sizes.
I could summarize my concerns with these questions:
Does the blockSize=1M recommendation from this thread account for the the 16KB torrent sub-piece size?
If so, how does it prevent “read amplification” (reading 1M for 16KB requested)?
Could that be that OpenZFS advises recordsize=16K because it matches the torrent blocks, not the pieces?
It really doesn’t matter whether a bittorrent client reads or writes to the disk in “sub-pieces,” because even if it’s splitting all of a torrent piece into 16K chunks and then sending to / reading from storage, it’s necessarily doing that for all sub-pieces of a piece, every time it does it.
Because torrent clients only trade torrents by the piece–not by the sub-piece–there is never a time when you have read or write amplification caused by reading or writing the full piece, rather than individual sub-pieces.
(Note: when I say they trade by the piece, not by the “sub-piece” or “chunk”, what I mean is that even if the individual requests are for chunks, you cannot ONLY request a chunk from a peer or seed: even if you’re asking for one chunk at a time, you can see just by watching the client activity that this mandates getting the REST of the chunks of that piece from the same peer or seed, unless it times out. And if it times out, it discards what it got from that peer or seed and asks for the whole piece all over again.)
In fact, this would explain a lot of why torrents get SO badly fragmented, even on fairly non-full pools, with small recordsizes–if the bittorrent client were actually telling the storage system “write this entire piece for me,” even with recordsize=16K, you’d generally speaking get each piece almost entirely contiguous–because those writes would be issued either in parallel, or contiguously in very rapid succession. But if the client is issuing 64 separate writes of 16KiB apiece, ZFS is a lot more likely to fragment them… if 16KiB is enough to fill a block, that is.
On the other hand, setting recordsize=1M forces your bittorrent client not to be an idiot and fragment the jesus out of your torrent pieces into “sub-pieces” on the disk–because even though the client might say “here’s 16K of data to put into this file,” the storage system itself says “that’s nice, now give me 63 more just like it and we’ll write a nice clean contiguous 1MiB to disk.”
Similarly, you can’t get read amplification, because again–regardless of the “sub piece” business, torrent clients trade entire pieces, not “sub-pieces.” So if you needed 16K “sub-piece one” out of a 1MiB torrent piece, you’re absolutely going to need the other 63 “sub-pieces” after it.
Now, even though your torrent client might very well issue 64 requests for 64 “sub-pieces”, since your recordsize=1M, what actually happens is the very first request for sub-piece one actually pulls in all 64… and the next 63 requests get filled from cache.
You may recognize this as “very, very little difference from simply requesting all 1MiB at once,” becaue the additional 63 IOPS never hit the drive at all.
This would be amplification if you rarely needed many of the other 63 sub-pieces. But since you always need the other 63, it is not read amplification, it is read IOPS reduction.
You don’t have to believe me. Just try it both ways, preferably on a spinning rust vdev. The difference in performance is not subtle. With recordsize=16K, I used to have trouble even managing to get some of my torrented “viewable ISOs” to play without stuttering, getting under 5MiB/sec sustained reads on the torrented files.
After I did some thinking and changed to recordsize=1M, torrenting the exact same file resulted in a copy that could be read at 200MiB/sec, on the same (small rust) pool.
This was not just a thought exercise for me; it’s lived experience.