ZFS Performance Tuning for BitTorrent

I don’t know this gentleman, but it makes a lot of sense to plan and understand a system. That is why I appreciate your answers here so much. Just sounds like a lot of feelings…

I understand your example (at least I hope I do), but are there not two things to consider here? 1) the optimal layout, 2) the total amount of data disks vs. parity disks. If I were to go with 2 6-wide RAIDz2 VDEVs, I would have bought 12 disks and “lost” 4 for parity. So there are 8 data disks total. If I went with a single 12 wide RAIDz2, I would only spend 2 disks on parity and still have 10 disks for data.

So overall, would the 12 wide RAIDz2 still have a worse efficiency, even though I would be “tanking” some non-optimal configurations with two additional data disks?

Also, how much of a concern is the poor efficiency of wide VDEVs if I have a larger block size?

And is blocksize the same as recordsize, or is blocksize the number of data chunks split from the equation recordsize / number of data disks?

Thanks.

It’s complicated.

If you’ve got two 6-wide Z2 vdevs, you’ve got eight drives’ worth of data, with no padding necessary, and (assuming recordsize is at least 16K) nearly all of your data will be able to stretch across the entire stripe. Your metadata blocks will only have 33% storage efficiency, as will any single-sector files. Your two and three sector files will also take a small hit to the naively expected 67% SE, since they will be at 50% (two data sectors + two parity sectors) and 40% (three data sectors + two parity sectors), respectively. But for most datasets, the vast majority will be a nice clean 67% SE, no fuss no muss.

If you do a twelve-wide Z2, though, every full-stripe write will require some padding, and every file smaller than 40KiB will be stored in a narrower stripe than the naive expectation. Ignoring the (relatively) small files for now and looking at the padding, you’ll wind up with inefficiency roughly like this:

128KiB block / 10 data drives == 12.8KiB per drive
12.8KiB / 4KiB/sector == 3.2 sectors per drive

You can’t store data in “0.2 sectors”, so that means you’re really storing data in 4 sectors per drive. 3.2 sectors / 4 sectors == 0.8 == 80%, so you’re only 80% efficient there on top of the 10/12 == 83.3% SE you were normally expecting. Put those together, and you’re at 67% SE… the exact same 67% SE you had with two six-wide Z2, but now with worse performance and more uncertainty about what you’ve actually got available to you.

Why is the performance worse? Well, you went from two vdevs to one, which means you cut your IOPS in half. On top of that, you’re storing less data per individual drive, which means each individual drive suffers from considerably worse performance and get fragmented more quickly.

Can you survive all that? Sure. Would you have been better off with the pair of six-wide? Absolutely; they’re simpler, significantly higher performance, less confusing, AND more resistant to failure (since you can lose not only any two disks, but up to four disks if you’re lucky about which ones you lose).

Also, how much of a concern is the poor efficiency of wide VDEVs if I have a larger block size?

The larger the blocksize, the smaller the performance issues. But you can’t get away from the fact that your IOPS are worse. The larger blocksize improves the performance per individual drive, but doesn’t do anything about the larger performance issues.

And is blocksize the same as recordsize

OpenZFS devs don’t ever talk about records, and mind-bogglingly enough, tend to get a little confused if YOU talk about records. To an OpenZFS dev (and you’ll see this if you ever look at the code), everything is stored in blocks. “Recordsize” is the tunable parameter that allows you to set the size of a block in a dataset.

That’s not the only kind of block–if you use ZVOLs, you’ll set blocksize in them using the volblocksize property. Volblocksize is essentially just like recordsize, but it’s for zvols not filesystem datasets, and unlike recordsize it’s immutable. You have to set volblocksize at zvol creation time, and once the zvol has been created, you can’t modify the volblocksize.

he number of data chunks split from the equation recordsize / number of data disks?

Unfortunately, OpenZFS devs don’t actually have terminology for what you’re reaching after here. In mdraid, what you’re describing is “chunk size”. OpenZFS doesn’t really concern itself directly with chunks the way mdraid does, because OpenZFS doesn’t write partially-empty stripes.

Mdraid–or other forms of conventional RAID–will store, for example, a 4KiB file on a 6-wide RAID6 just like it would store a 128KiB file… split into pieces across the entire stripe width. If, later, you want to store a second 4KiB file, the RAID controller will slide that file into the empty space around the first 4KiB file in the same stripe, recalculate the parity, and you actually haven’t used any additional room when you store the second file (but you were INCREDIBLY wasteful when you stored the first one, since it ate an entire stripe worth of storage space).

OpenZFS, on the other hand, has dynamic stripe width. So it isn’t as concerned with chunk size, because it will simply store smaller files in a narrower stripe (as described above, talking about eg 4KiB files on a Z2). Since it never has to rewrite a stripe with additional data crammed in (I hypothesize) its devs never really had to be so concerned with the specific amount of data per disk that they actually set a term for it.

So, yeah, technically there is no such thing as “chunk size” on RAIDz. Where on an mdraid6 you might specifically set chunk size to 256KiB in order to write up to 256KiB to each disk in a single stripe (opening the door for reading / modifying / recaluclating parity / rewriting even when multiple files are stored on the same DISK in the same stripe!), in ZFS you simply split any given block up as evenly as possible amongst all available drives, UNLESS the file is too small to need the entire stripe, in which case it’s written to a single sector per drive in the stripe.

Have I mentioned that this stuff is complicated? :slight_smile:

2 Likes

Man, you are a legend! It is really complicated, but after reading it 5 times (and understanding a little more each time) I think I understand what you are telling me here.

I think you mean here (three data sectors + two parity sectors) - please confirm.

I understand the performance side; two vdevs allow double the write performance (IOPS and sequential), right? The data still needs to be read from one VDEV, so there is no read performance benefit. I would add more mirrors for that. I hope that was correct, but please correct me if I am wrong.

But what do you mean by the other part? That I have a higher chance of not losing data because I have more parity disks?

Read or write IOPS?

Good point! I never thought of that!

Because the padding is not needed for most of the blocks and only for the last non-optimal sector?

Makes sense, because I miss the second VDEV.

So to sum up, a 12 wide RAIDz2 has the same SE as 2 6 wide RAIDz2 but the 2x6 combination has better performance for the single disks and because its two VDEVs AND better failure guarantees (4 parity disks). I just see more raw capacity on the 12 wide while effectively they are the same?

So how would you structure a 15 slot storage server? I have 2 NVME disks for SLOG device. 2x6 wide RAIDz2, 1 spare and two additional mirrored SATA SSDs as metadata VDEV?

What is the best layout for RAIDz3? I assume 5 wide, 7 wide, 11 wide or 19 wide?

Have you ever seen a 19 wide? Is that too crazy?

For any RAIDz, the optimal widths are 2+p, 4+p, and 8+p, where p is the parity level 1, 2, or 3. So, yes, ideal widths for RAIDz3 are 5, 7, and 11.

19 wide is absolutely batshit, yes. In order to even take advantage of a stripe that wide, you’d need an absolutely MASSIVE recordsize, which would in turn present some pretty large latency issues on anything but massive files which needed to be written to and read from in their entirety, not in smaller pieces. And the actual returns you get on storage efficiency even if you do all that right are just vanishingly small.

Even without taking all the potential ways to lose it all to inefficiency, the naive SE expectation for 11-wide Z3 is 8/11 == 73%. For 19-wide Z3 it’s 16/19 == 84%. It might be worthwhile if you could actually reap that entire 11%, but even then, only marginally because you’d still be saddled with increasingly terrible performance AND lower failure resistance. And for all the reasons we’ve already talked about, you’re not actually going to get that full expectation of an additional 11% efficiency, because it’s not going to help you any with undersized blocks that don’t occupy the full stripe width.

1 Like

What about the other questions and how would you layout 15 slots? Thanks!

Too many questions, I’m exhausted. :slight_smile:

Personally, I’d do two-wide mirrors in 15 bays; seven plus if I was really itching to fill that fifteenth bay, a hotspare. But I value performance, fast resilvering, and ease of capacity upgrades well beyond the additional capacity you get with RAIDz2 (or, for that matter, the additional parity. Though the additional parity is the most tempting part).

You could opt instead for 5x3-wide Z1. That gets you most of the performance of mirrors, with even better storage efficiency than six-wide Z2, at the cost of a little less fault resistance than mirrors.

Or, the safe route: two 6-wide Z2, with or without a spare in the last bay.

I can’t really pick one of those for you directly; you have to answer for yourself which properties you value most. I do think those are the only three configs worth serious consideration for what I hear from your goals and 15 total bays, though.

If you decide to go mirrors, you might also consider NOT filling every single bay right up front. Four mirrors will get you more throughput than you know what to do with, and if that’s enough capacity to last you for a couple of years, you’ll have larger drives for less money per TiB available a few years later, and can just add more mirrors. Eventually, you start replacing the oldest mirrors–still getting capacity upgrades with every two drives you replace.

1 Like

Thank you so much! You have helped me a lot to understand the topic. I think I will go with my non-optimal VDEV until it is 70% full and then I will swap my data to a new 6 wide RAIDz2 and then repurpose my old one to a 6 wide RAIDz2 as well.

And all this just for the Linux ISOs.

1 Like