I’m planning a semi-major layout change for a new build. Lets say I have 20 hard drive bays and I want to fill up all of them and I have a tons of options to choose from.
We have options like:
4x5 Z1
5x4 Z2
2x8 Z2
2x10 Z1
2x10 Z2
etc.
I understand the more I split it up I get ‘more redundancy’ as well as easier to expand the pool later on etc for the cost of usable space, but are there other technical reasons to NOT make something crazy like a 2x10 wide Z2 pool? What kind of possible overhead starts getting taken into consideration at higher vdev levels?
Generally speaking, performance scales with vdev count, not with individual drive count. So, long story short, the more vdevs, the higher the performance.
If you want higher storage efficiency (more storage for the same number of drives), you want wider vdevs, preferably optimal width (eg, the number of drives in each vdev is a power of two after deducting parity count: so, 3-wide Z1, and 4, 6, or 10-wide Z2). But expect your performance to decrease along with it, and also don’t necessarily expect to get as much storage out of the deal as it looks like on paper, with wider vdevs–for example, if you store a 4KiB file on a Z2 vdev, it’ll occupy 12KiB for a horrifyingly poor 33% storage efficiency no matter what because undersize blocks go on undersize stripes.
With 20 bays to work with, the most commonly recommended/recommendable layouts are:
10x 2-wide mirrors
6x 3-wide Z1 (with two bays open for auxiliary vdevs, spares, whatever)
5x 4-wide Z2
3x 6-wide Z2 (with two bays open)
2x 10-wide Z2
The above are ranked in order of decreasing performance.
I do not recommend ever going wider than three disks on a Z1 vdev–that’s simply not sufficient redundancy for the number of points of failure present.
@mercenary_sysadmin Thanks for that. You summarized in a few paragraphs something that I still didn’t completely understand the first go around after watching about half a dozen YouTube videos.
I went with the all-mirrors set up, despite the 50 percent storage efficiency. That’s still more than enough space for my needs.
One thing I didn’t consider when I first started is, given the size of modern HDDs and their relatively slow speed, how long it takes to resilver a vdev when a disk needs to be replaced (that is, to sync everything up to the new disk so the vdev is back up to the full level of redundancy that has been chosen).
For example, on a traditional ext4 file system, I once had to replace a drive in a single mirror of 2x14 TB disks. It took 24 hours to resilver, and during that time the entire mirror was vulnerable to failure: I’d have lost everything if I lost the second disk before it was done resilvering.
In a weird way, I’m kind of glad that disk failed when it did. It’d have been a lot less pleasant to learn that lesson with an oversized Z1 or something.
Mirrors resilver faster than any other type of ZFS vdev because they just need to copy all the data from one single disk to the new disk, which decreases the time the data in the mirror is vulnerable. Resilvering itself is an intense operation, so it arguably can put some disks at higher risk of failing – which you don’t want to have happen during resilvering. So, quicker is better.
(RAIDZ1/2/3 has to do a lot more parity calculations, so it can take much longer and is usually more compute intensive and harder on the disks).
An advantage of ZFS is that because it manages both the filesystem and the redundancy, it knows exactly which blocks need to be copied. Time saved depends on how full the pool is. Additionally, part of what its scrubs accomplish is to increase your confidence that something unreadable on a disk is not secretly waiting to impede your ability to restore a pool’s redundancy. Not that you needed persuading! But hopefully your resilvering experience is better with ZFS.
Anyway, when pondering pool geometry, if I can’t have a mirror pool, I tend to re-read ZFS RAIDZ stripe width, or: How I Learned to Stop Worrying and Love RAIDZ, remembering that it’s from 2014, though, and like Jim, I would not use raidz1 with more than three disks of today’s sizes, let alone five (maybe not even yesterday’s, remembering that Xserve RAID, shudder), even with 512-byte sectors, tiny recordsizes, and no compression.
While I’m remembering things, I’ll also say I personally avoid making widths a multiple of the raidz level plus one, having once spent too much time trying to figure out why I seemed to have a hot spot on every fourth disk (so not, for example, a 6-wide Z2), but if that were actually worth worrying about, I imagine I would have seen someone else mention it, and I never have.
“…An advantage of ZFS is that because it manages both the filesystem and the redundancy, it knows exactly which blocks need to be copied. Time saved depends on how full the pool is…”
Means on a tradional raid mirror system the layered filesystem ontop doesn’t matter and the rebuild-resilver were at good 160MB/s. I just replaced a 16TB hdd in a 4x6 raidz2 (working) pool which was 32% used and took 32h, so the advantage which “just should be resilvered” is just good when pool is quiet empty otherwise don’t be surprised. Mirrored vdev’s do better in resilver and much better in daily usage than raidz* ones.
Found this post when I googled information about resilver a pool of several raidz1 and why all drives in the pool are scanning instead of 2 of 3 healthy drive in the same raidz1 vdev where 1 of 3 drive was replaced.
I thought my pool of raidz1-N is like a stripe of single disks, so why zfs require to scan through other disks in the stripe to fix the degradaded one of raidz1? Can some explain that zfs logic?
I’m not sure I understand the question. I see two degraded vdevs in that pool, which means it needs to light up six drives (four read, two write) during this resilver.
What makes you think it’s lighting up the other three vdevs as part of the resilvering process? I only see significant activity on the two vdevs that are degraded:
Out of 1.44K pool read operations, 750 happen on raidz1-2 and 711 happen on raidz1-4. Out of 500 pool write operations, 191 happen on raidz1-2 and 242 happen on raidz1-4.
And since raidz1-2 and raidz1-4 are your two degraded vdevs, this lines up with expectations: the only significant read or write operations going on during the moment in time you captured are happening on the degraded vdevs being resilvered.
You uderstood my question correctly, and seems you are right. Thanks
My both vdevs will be resilvered soon.
scan: resilver in progress since Fri Oct 18 00:06:21 2024
61.7T / 61.8T scanned, 19.9T / 21.5T issued at 434M/s
6.65T resilvered, 92.51% done, 01:04:49 to go
This is true when you are considering IOPS, but not true when you are considering throughput i.e. GB/s .
IOPS is dependent upon the number of vDevs, but throughput (and especially write throughput) is dependent upon the number of data drives excluding redundancy.
And generally speaking, IOPS is important when you are doing small I/Os i.e. reading and writing 4KB blocks because you are doing random database accesses or using virtual disks, but for sequential files throughput is the more important metric.
Furthermore, if you are doing 4KB reads and writes, then even more importantly than IOPS you need to avoid read amplification and even more importantly write amplification (where you want to read 4KB but end up reading 48KB, or you want to write 4KB and end up reading 48KB and then writing 48KB). And this means using mirrors rather than RAIDZ.
And of course if you are doing small random 4KB writes, your use case probably also needs synchronous writes - so you either need your data on SSD or you need an SSD SLOG.
By comparison, if you are doing sequential reads and writes, your files are typically going to be big enough to span multiple 4KB x RAIDZ width, and so you can read or write much more data in a single IOP which is much more efficient use of the disk bandwidth. And of course you also get sequential pre-fetch.
Then you need to consider space efficiency - RAIDZ is more space efficient and thus more cost effective than mirrors, and a 6-wide RAIDZ2 is better redundancy than 2x 3-wide RAIDZ1, and 12-wide RAIDZ2 has both better redundancy and greater storage efficiency than 4x 3-wide RAIDZ1.
Thus your pool design is NOT as simple as more vDevs = better. For most common use cases, you need to start from your use case, and follow some simple rules of thumb to create your pool design.
Whilst this is a true statement it is IMO extremely misleading. If you want double redundancy, then a 4KB file will always occupy 12KB - if you have a 3-way mirror, this will still be true. However, let’s compare the storage actually used for a 48KB file on various pool layouts with double redundancy (each of which has the same number of data disks - and I have 12 data drives because it has many factors and so can have many examples):
36x disks in mirror triples - 4KB data = 12KB storage, 48KB data = 144KB
24x disks in 6x 4-wide RAIDZ2 - 4KB data = 12KB storage, 48KB data = 96KB
20x disks in 4x 5-wide RAIDZ2 - 4KB data = 12KB storage, 48KB data = 80KB
18x disks in 3x 6-wide RAIDZ2 - 4KB data = 12KB storage, 48KB data = 72KB
16x disks in 2x 8-wide RAIDZ2 - 4KB data = 12KB storage, 48KB data = 64KB
14x disks in 1x 14-wide RAIDZ2 - 4KB data = 12KB storage, 48KB data = 56KB
If storage efficiency and cost are not important, then even for larger sequential files mirrors will probably give you the best throughput - you get 3x the read throughput of RAIDZ, and an equal write throughput, but it comes at a very high cost per TB. This is because reads from mirrors require a read only from one mirror (unless there is a checksum failure) and so reads can be spread amongst the mirrors in a vDev - but reads on a RAIDZ vDev require data to be read from all drives.
But once you start you use RAIDZ, throughput is based on the number of data drives excl. redundancy, so for all the above layouts (which have the same number of data drives and the same useable space), throughput will be the same.
And unless you’ve got an extremely specialized and tightly controlled workload–such as a limited process data ingest–that can be handled sequentially with large block operations only, you’re going to hit IOPS bottlenecks before throughput bottlenecks pretty much every time.
The only real exception–again, apart from very specialized workloads–is when you’ve got extremely fast drives and a much slower controller in front of them (single-lane HBA, or–worse–slow NIC).
With respect, this is wharrgarbl. You cannot casually give an explicit relationship between performance of different topologies without so much as hinting at the the workload in question, or even the number or width of the vdevs in question.