How to layout a 60 drive RAIDz3 spinning rust pool for maximum space available

Brian · May 15, 2025, 7:29pm

Thoughts on what the best ZFS layout is for a 45drives XL60 to maximize the available storage space. This will be using 20+ TB rust drives and is one piece of a long term file backup strategy for retention purposes. Once files are copied to it the chance of them being accessed is very minimal and they will never be modified.

One suggestion is to just go super wide and setup four 15 drive raid-z3 vdevs in the pool with no spares, which should give the largest usable capacity. As super wide vdevs are usually not recommended due to the long rebuild times, do we go with 6 10 wide raid-z3 vdevs instead? This decreases capacity but should increase drive failure tolerance, decrease rebuild time, and increase performance. For this use case, the increase in performance from the extra vdevs is really not needed.

I have no experience yet using DRAID, but would this be a better solution for a system with this many drives? How well is DRAID supported and is it considered mature enough for production use yet? What kind of DRAID layouts would be suggested for this use case?

mercenary_sysadmin · May 15, 2025, 11:30pm

Five 11-wide Z3 + (optional) hotspares in the remaining bays. This will not be an effective space maximizer unless you set recordsize to AT LEAST 1M and almost all of the files you store on it are AT LEAST 1M or larger themselves, due to the way OpenZFS splits each block (aka the “record” in “recordsize”) into pieces to stripe across the vdev.

You don’t want 12-wide through 18-wide because you can’t split a block evenly into 9, 10, 11, 12, 13, 14, or 15 pieces, so you have to waste space on padding in every block.

You don’t want 19-wide vdevs because although you can split a block evenly into 16 pieces, even a 1M block ends up split into only 64K per drive, which is much less than ideal for performance–especially with the very large files that are about all it makes any kind of sense to store on such a wide Z3.

At 60 bays, you are in potential DRAID territory. But only barely, and I strongly advise against trying to wrap your head around DRAID in production before you thoroughly understand classic RAIDz.

Brian · May 16, 2025, 1:10am

I requested a full file size distribution from the app team, but from what I can determine so far the smallest file to be stored would be 50 MB.

The other option floated by someone was to use a Ceph cluster of multiple smaller nodes instead of large 60+ drive servers. This seemed like overkill and overly complicated for what essentially will be a static archive with little access.

memnoch_proxy · May 16, 2025, 1:15am

Yeah, good choice to eschew extra complexity

memnoch_proxy · May 16, 2025, 1:17am

This is really interesting! I’m only experienced with volumes up to 12 drives. Is there a paper or talk you can drop a link for?

mercenary_sysadmin · May 16, 2025, 1:27am

As much as I tend to question the validity of Z3 at all, this is definitely a case that calls for wide RAIDz vdevs.

Personally, I’d probably go with five 10-wide Z2 vdevs and a few hotspares, instead of Z3. Z2 already gives you two guaranteed failures per vdev, and the hotspares can automatically kick in to rescue any vdev in the pool–whereas your third parity drive in the Z3 vdevs is utterly useless if you lose three or four drives out of a different vdev.

(Why three drives, when we’re talking Z3 not Z2? Even if you don’t lose the entire vdev, as soon as you lose the last parity level, the entire vdev is “uncovered.” An uncovered vdev not only cannot repair any new corruption that occurs while it’s uncovered, any existing corruption not yet uncovered by a read or during a scrub instantly becomes permanently irreparable.)

Given that this is, if described accurately, a specialty server well-suited to wide RAIDz and very large recordsize, I’m much happier about both the width of the vdevs and their recordsizes, and in fact would recommend trying recordsize=4M. And the Z3 is less obnoxious than usual, because in this case you can’t actually add an additional vdev (which means considerably more real-world fault tolerance, as well as additional performance) just because you dropped the parity level (whereas a twenty-bay box can have either a single 19-wide Z3, or two 10-wide Z2).

Depending on the performance level desired, this would also be a good case for the special vdev, which I do not often recommend. The special will store all your metadata, which otherwise would need to go on four-wide undersize stripes in your 11-wide or 19-wide vdevs. This improves your experienced storage efficiency, and may provide some significant improvements to experienced latency on some operations also.

The important things to remember about the special, should you decide to deploy it:

losing the special loses the entire pool. No coming back. No kidding. Be aware.
due to the above fact, a special should ALWAYS offer just as much fault tolerance as the main vdevs… which in the case of Z3 main vdevs, means you need a four-way mirrored special.
Although many large systems are using the special and swear by it, my own testing on smaller (eight-twelve bay) systems has been unable to reproduce significant performance benefits, and has uncovered some bugs that led to entire pool loss in my test environments.

Good luck–and we’d love to hear a report back with your results!

mercenary_sysadmin · May 16, 2025, 1:41am

Not off the top of my head. Not to be dismissive–seriously!–but it essentially boils down to “understand the way RAIDz works, and think it through.” If you really grok RAIDz functionality, without needing to constantly refer to explainers, the implications are pretty clear–and you can see them borne out, for example, in my own admittedly only eight-bay tests at Ars Technica.

See what happens to 1MiB read throughput, as you add drives to a RAIDz vdev (as opposed to what happens when you add vdevs to a pool)? That trend line only continues to get worse!

This isn’t a ZFS specific problem, either. I’m going to shift from 8-process 1MiB reads to 8-process 1MiB writes, now, because RAID6 does well at reads while struggling with writes, as opposed to RAIDz doing well with writes but struggling with reads… but keep in mind, either way, significantly parallel 1MiB random async I/O with a decent queue depth (numjobs=8, iodepth=8 in this case) is pretty much the EASIEST possible storage workload on a random-access device!

Now, look at what happens to RAID6 when we’re testing numjobs=8, iodepth=8 on 4K sync writes, instead of 4K reads…

… and while we’re at it, don’t just look at the shape of the graph… look at the raw numbers. 0.6 MiB/sec for a three-drive RAID5, under 0l4 MiB/sec for an eight-wide RAID6. Yikes.

If you’re not depressed enough yet, I invite you to look at what happens to single process operations as a diagonal parity RAID array gets wider… and again, don’t just look at the shape of the charts (which is bad enough), look at those numbers on the Y axis!

TL;DR: contrary to many, many, many storage admins’ strongly-held but naive expectations, diagonal parity RAID is not for performance–it’s for uptime and capacity! If you want performance, you need more vdevs if using ZFS, or a dual-layer RAID algorithm (RAID10, RAID50, RAID60) if using conventional RAID.

Brian · May 22, 2025, 7:38pm

This is a great point to consider that having the extra drives as hot spares as opposed to a member in the vdev might be better in this instance. Taking this and the other comments here are the two options I am considering.

Five 11 wide Z3 vdevs with 5 drives as hot spares.
Five 10 wide Z2 vdevs with 5 drives as hot spares and 5 free drive slots.

Both of these configurations should give us greater than 800 TB of usable space. Both will use recordsize=4M or larger depending on what the file size distribution report comes back with. The Z2 option does mean leaving 5 bays unused, which would then be available for creation of a special vdev if needed and give more slots to handle the swapping of failed drives.