I don’t know this gentleman, but it makes a lot of sense to plan and understand a system. That is why I appreciate your answers here so much. Just sounds like a lot of feelings…
I understand your example (at least I hope I do), but are there not two things to consider here? 1) the optimal layout, 2) the total amount of data disks vs. parity disks. If I were to go with 2 6-wide RAIDz2 VDEVs, I would have bought 12 disks and “lost” 4 for parity. So there are 8 data disks total. If I went with a single 12 wide RAIDz2, I would only spend 2 disks on parity and still have 10 disks for data.
So overall, would the 12 wide RAIDz2 still have a worse efficiency, even though I would be “tanking” some non-optimal configurations with two additional data disks?
Also, how much of a concern is the poor efficiency of wide VDEVs if I have a larger block size?
And is blocksize the same as recordsize, or is blocksize the number of data chunks split from the equation recordsize / number of data disks?
If you’ve got two 6-wide Z2 vdevs, you’ve got eight drives’ worth of data, with no padding necessary, and (assuming recordsize is at least 16K) nearly all of your data will be able to stretch across the entire stripe. Your metadata blocks will only have 33% storage efficiency, as will any single-sector files. Your two and three sector files will also take a small hit to the naively expected 67% SE, since they will be at 50% (two data sectors + two parity sectors) and 40% (three data sectors + two parity sectors), respectively. But for most datasets, the vast majority will be a nice clean 67% SE, no fuss no muss.
If you do a twelve-wide Z2, though, every full-stripe write will require some padding, and every file smaller than 40KiB will be stored in a narrower stripe than the naive expectation. Ignoring the (relatively) small files for now and looking at the padding, you’ll wind up with inefficiency roughly like this:
128KiB block / 10 data drives == 12.8KiB per drive
12.8KiB / 4KiB/sector == 3.2 sectors per drive
You can’t store data in “0.2 sectors”, so that means you’re really storing data in 4 sectors per drive. 3.2 sectors / 4 sectors == 0.8 == 80%, so you’re only 80% efficient there on top of the 10/12 == 83.3% SE you were normally expecting. Put those together, and you’re at 67% SE… the exact same 67% SE you had with two six-wide Z2, but now with worse performance and more uncertainty about what you’ve actually got available to you.
Why is the performance worse? Well, you went from two vdevs to one, which means you cut your IOPS in half. On top of that, you’re storing less data per individual drive, which means each individual drive suffers from considerably worse performance and get fragmented more quickly.
Can you survive all that? Sure. Would you have been better off with the pair of six-wide? Absolutely; they’re simpler, significantly higher performance, less confusing, AND more resistant to failure (since you can lose not only any two disks, but up to four disks if you’re lucky about which ones you lose).
Also, how much of a concern is the poor efficiency of wide VDEVs if I have a larger block size?
The larger the blocksize, the smaller the performance issues. But you can’t get away from the fact that your IOPS are worse. The larger blocksize improves the performance per individual drive, but doesn’t do anything about the larger performance issues.
And is blocksize the same as recordsize
OpenZFS devs don’t ever talk about records, and mind-bogglingly enough, tend to get a little confused if YOU talk about records. To an OpenZFS dev (and you’ll see this if you ever look at the code), everything is stored in blocks. “Recordsize” is the tunable parameter that allows you to set the size of a block in a dataset.
That’s not the only kind of block–if you use ZVOLs, you’ll set blocksize in them using the volblocksize property. Volblocksize is essentially just like recordsize, but it’s for zvols not filesystem datasets, and unlike recordsize it’s immutable. You have to set volblocksize at zvol creation time, and once the zvol has been created, you can’t modify the volblocksize.
he number of data chunks split from the equation recordsize / number of data disks?
Unfortunately, OpenZFS devs don’t actually have terminology for what you’re reaching after here. In mdraid, what you’re describing is “chunk size”. OpenZFS doesn’t really concern itself directly with chunks the way mdraid does, because OpenZFS doesn’t write partially-empty stripes.
Mdraid–or other forms of conventional RAID–will store, for example, a 4KiB file on a 6-wide RAID6 just like it would store a 128KiB file… split into pieces across the entire stripe width. If, later, you want to store a second 4KiB file, the RAID controller will slide that file into the empty space around the first 4KiB file in the same stripe, recalculate the parity, and you actually haven’t used any additional room when you store the second file (but you were INCREDIBLY wasteful when you stored the first one, since it ate an entire stripe worth of storage space).
OpenZFS, on the other hand, has dynamic stripe width. So it isn’t as concerned with chunk size, because it will simply store smaller files in a narrower stripe (as described above, talking about eg 4KiB files on a Z2). Since it never has to rewrite a stripe with additional data crammed in (I hypothesize) its devs never really had to be so concerned with the specific amount of data per disk that they actually set a term for it.
So, yeah, technically there is no such thing as “chunk size” on RAIDz. Where on an mdraid6 you might specifically set chunk size to 256KiB in order to write up to 256KiB to each disk in a single stripe (opening the door for reading / modifying / recaluclating parity / rewriting even when multiple files are stored on the same DISK in the same stripe!), in ZFS you simply split any given block up as evenly as possible amongst all available drives, UNLESS the file is too small to need the entire stripe, in which case it’s written to a single sector per drive in the stripe.
Man, you are a legend! It is really complicated, but after reading it 5 times (and understanding a little more each time) I think I understand what you are telling me here.
I think you mean here (three data sectors + two parity sectors) - please confirm.
I understand the performance side; two vdevs allow double the write performance (IOPS and sequential), right? The data still needs to be read from one VDEV, so there is no read performance benefit. I would add more mirrors for that. I hope that was correct, but please correct me if I am wrong.
But what do you mean by the other part? That I have a higher chance of not losing data because I have more parity disks?
Read or write IOPS?
Good point! I never thought of that!
Because the padding is not needed for most of the blocks and only for the last non-optimal sector?
Makes sense, because I miss the second VDEV.
So to sum up, a 12 wide RAIDz2 has the same SE as 2 6 wide RAIDz2 but the 2x6 combination has better performance for the single disks and because its two VDEVs AND better failure guarantees (4 parity disks). I just see more raw capacity on the 12 wide while effectively they are the same?
So how would you structure a 15 slot storage server? I have 2 NVME disks for SLOG device. 2x6 wide RAIDz2, 1 spare and two additional mirrored SATA SSDs as metadata VDEV?
What is the best layout for RAIDz3? I assume 5 wide, 7 wide, 11 wide or 19 wide?
For any RAIDz, the optimal widths are 2+p, 4+p, and 8+p, where p is the parity level 1, 2, or 3. So, yes, ideal widths for RAIDz3 are 5, 7, and 11.
19 wide is absolutely batshit, yes. In order to even take advantage of a stripe that wide, you’d need an absolutely MASSIVE recordsize, which would in turn present some pretty large latency issues on anything but massive files which needed to be written to and read from in their entirety, not in smaller pieces. And the actual returns you get on storage efficiency even if you do all that right are just vanishingly small.
Even without taking all the potential ways to lose it all to inefficiency, the naive SE expectation for 11-wide Z3 is 8/11 == 73%. For 19-wide Z3 it’s 16/19 == 84%. It might be worthwhile if you could actually reap that entire 11%, but even then, only marginally because you’d still be saddled with increasingly terrible performance AND lower failure resistance. And for all the reasons we’ve already talked about, you’re not actually going to get that full expectation of an additional 11% efficiency, because it’s not going to help you any with undersized blocks that don’t occupy the full stripe width.
Personally, I’d do two-wide mirrors in 15 bays; seven plus if I was really itching to fill that fifteenth bay, a hotspare. But I value performance, fast resilvering, and ease of capacity upgrades well beyond the additional capacity you get with RAIDz2 (or, for that matter, the additional parity. Though the additional parity is the most tempting part).
You could opt instead for 5x3-wide Z1. That gets you most of the performance of mirrors, with even better storage efficiency than six-wide Z2, at the cost of a little less fault resistance than mirrors.
Or, the safe route: two 6-wide Z2, with or without a spare in the last bay.
I can’t really pick one of those for you directly; you have to answer for yourself which properties you value most. I do think those are the only three configs worth serious consideration for what I hear from your goals and 15 total bays, though.
If you decide to go mirrors, you might also consider NOT filling every single bay right up front. Four mirrors will get you more throughput than you know what to do with, and if that’s enough capacity to last you for a couple of years, you’ll have larger drives for less money per TiB available a few years later, and can just add more mirrors. Eventually, you start replacing the oldest mirrors–still getting capacity upgrades with every two drives you replace.
Thank you so much! You have helped me a lot to understand the topic. I think I will go with my non-optimal VDEV until it is 70% full and then I will swap my data to a new 6 wide RAIDz2 and then repurpose my old one to a 6 wide RAIDz2 as well.
I hope you feel refreshed after a few months of rest. I have two follow-up questions.
Does BitTorrent piece size have any effect on ZFS performance? I know only the torrent creator can set it, but if I were in a situation where I could theoretically change it, is there a good answer in terms of performance and ZFS?
Let’s say you want to go big, is there a feasible layout to get 10Gbps (10 not 1) throughput with spinning disks without sacrificing too much usable space? Not that I plan to do this, but I want to understand ZFS performance better. My beginner’s mind thinks that if I had a big server with 60 enterprise drives that randomly come out at 150 IOPS (realistic?) and I split them into 10 x 6-wide RAIDz2 vdevs, I would have 150 x 10 IOPS. Assuming the data is equally requested and I set a record size of 4M (meaning each data disk has 1M), would all the disks be able to provide 150 IOPS x 10 x 1M = 1500M. I know I am way off here, but I would like to know how the math works. If 1500M is the right approximate answer, then we would be in the 10Gbps range.
Correct, you can’t set the piece size as a BitTorrent consumer. But if you could, you’d get the best performance out of larger (>1MiB) piece size, matched to large (1M) recordsize. An even larger recordsize might repeat might get you somewhat higher performance in some cases, but in my experience you get very little gain from blocksize or recordsize larger than 1M, whereas the potential harm due to latency spikes continues to rise very well indeed. I do not generally recommend recordsize larger than 1MiB, especially without careful actual testing against the specific workload in play.
You’re on essentially the right track, just keep in mind that you’re talking rules of thumb, not hard numbers. You may not get exactly what you expect, but the scaling will generally follow the train of thought you laid out here. The biggest issue is that you may see significantly LOWER iops out of a raidz vdev than you see out of a single disk of the same class. What this means for your particular workload is something you’re not going to find out for sure without testing… And keep in mind, recordsize=1M on a six wide Z2 doesn’t mean writing 1M to each individual drive, it means writing 256KiB to each individual drive, which is not optimal for rust drive throughput.
If you’re serious about getting extremely high performance, you may need to go with mirrors. You can also try recordsize=4M, if your version of OpenZFS is modern enough to support it. That will put the amount of data written to individual drives be 1MiB at a time, which is optimal for rust… But depending on your workload, again, you may find that you’ve traded higher potential top end throughput with the higher recordsize for nasty latency spikes at that higher recordsize, which may or may not also result in worse throughput.
Essentially, I can give you recommendations and guidelines, but you want to operate at a scale that’s going to require some testing and potential adjustments, if you are genuinely serious about the need to saturate or near saturate 10Gbps or higher interfaces.
A (next to) last warning: at 10Gbps or more, the storage stack is not your only bottleneck. You also need a fast processor with a ton of PCIe lanes, and a workload that can manage multi threaded network connectivity (eg pNFS); most processors aren’t capable of breaking 5-6Gbps on a single CPU thread, in my experience, whether the storage stack itself is even involved.
You may also need to mess around with jumbo frames, which is another can of worms that can be a huge pain all on its own. Fair warning.
Why is it a good idea to match BitTorrent piece size and ZFS record size? I was thinking that BitTorrent’s piece size is just a balance between metadata overhead and sharing efficiency. Or is there more to the piece size that affects how ZFS stores the data?
I wanted to go with rs=4M so that each disk gets a 1M chunk. So the same thing you pointed out was just a misunderstanding. Great to hear you say that my thumb calc is the right approach. What random IOPS (which are essential for torrents) can I expect from enterprise hard drives in a 6-wide raidz2? Say Seagate Exos series.
For seeding, near-matching piece and block size is potentially valuable because it reduces read amplification (having to read a bunch of unnecessary data in order to get the data you need for a much smaller request).
Regardless of read amplification concerns, you want a large recordsize on stuff you’re acquiring via torrent, whether you later intend to seed or not. This is because the mechanics of the BitTorrent protocol tend to result in maximal fragmentation, so increased blocksize means random I/O at a larger blocksize when reading, and large block I/O offers significantly higher throughput than small block I/O does.
This is really a little too vague. Raidz performance varies pretty drastically according to workload. All I can really tell you is to expect similar to moderately lower IOPS than a single disk of the vdev would offer, and that you generally expect around 150-200 IOPS from most decent modern rust.
If your workload is essentially exclusively torrent acquisition and seeding, without further information I would cautiously expect one 6w Z2 vdev on rust at RS=4M to offer around 400-600 MiB/sec throughput, total (meaning read and write combined). Assuming a sufficiently parallel workload–which seeding certainly ought to be–I’d cautiously expect roughly linear scaling with additional matching vdevs.
This is all, of course, assuming you don’t hit a specific bottleneck somewhere, like (but not limited to) the issue with the maximum network throughput per individual CPU thread that I mentioned briefly above.