Is there an optimal or max level of vdevs per RAIDZ config?

adamzwakk · July 5, 2024, 5:13pm

I’m planning a semi-major layout change for a new build. Lets say I have 20 hard drive bays and I want to fill up all of them and I have a tons of options to choose from.

We have options like:
4x5 Z1
5x4 Z2
2x8 Z2
2x10 Z1
2x10 Z2
etc.

I understand the more I split it up I get ‘more redundancy’ as well as easier to expand the pool later on etc for the cost of usable space, but are there other technical reasons to NOT make something crazy like a 2x10 wide Z2 pool? What kind of possible overhead starts getting taken into consideration at higher vdev levels?

mercenary_sysadmin · July 5, 2024, 10:14pm

Generally speaking, performance scales with vdev count, not with individual drive count. So, long story short, the more vdevs, the higher the performance.

If you want higher storage efficiency (more storage for the same number of drives), you want wider vdevs, preferably optimal width (eg, the number of drives in each vdev is a power of two after deducting parity count: so, 3-wide Z1, and 4, 6, or 10-wide Z2). But expect your performance to decrease along with it, and also don’t necessarily expect to get as much storage out of the deal as it looks like on paper, with wider vdevs–for example, if you store a 4KiB file on a Z2 vdev, it’ll occupy 12KiB for a horrifyingly poor 33% storage efficiency no matter what because undersize blocks go on undersize stripes.

With 20 bays to work with, the most commonly recommended/recommendable layouts are:

10x 2-wide mirrors
6x 3-wide Z1 (with two bays open for auxiliary vdevs, spares, whatever)
5x 4-wide Z2
3x 6-wide Z2 (with two bays open)
2x 10-wide Z2

The above are ranked in order of decreasing performance.

I do not recommend ever going wider than three disks on a Z1 vdev–that’s simply not sufficient redundancy for the number of points of failure present.

SinisterPisces · July 6, 2024, 1:23am

Hello,

@mercenary_sysadmin Thanks for that. You summarized in a few paragraphs something that I still didn’t completely understand the first go around after watching about half a dozen YouTube videos.

I went with the all-mirrors set up, despite the 50 percent storage efficiency. That’s still more than enough space for my needs.

One thing I didn’t consider when I first started is, given the size of modern HDDs and their relatively slow speed, how long it takes to resilver a vdev when a disk needs to be replaced (that is, to sync everything up to the new disk so the vdev is back up to the full level of redundancy that has been chosen).

For example, on a traditional ext4 file system, I once had to replace a drive in a single mirror of 2x14 TB disks. It took 24 hours to resilver, and during that time the entire mirror was vulnerable to failure: I’d have lost everything if I lost the second disk before it was done resilvering.

In a weird way, I’m kind of glad that disk failed when it did. It’d have been a lot less pleasant to learn that lesson with an oversized Z1 or something.

Mirrors resilver faster than any other type of ZFS vdev because they just need to copy all the data from one single disk to the new disk, which decreases the time the data in the mirror is vulnerable. Resilvering itself is an intense operation, so it arguably can put some disks at higher risk of failing – which you don’t want to have happen during resilvering. So, quicker is better.

(RAIDZ1/2/3 has to do a lot more parity calculations, so it can take much longer and is usually more compute intensive and harder on the disks).

quartsize · July 6, 2024, 9:34pm

An advantage of ZFS is that because it manages both the filesystem and the redundancy, it knows exactly which blocks need to be copied. Time saved depends on how full the pool is. Additionally, part of what its scrubs accomplish is to increase your confidence that something unreadable on a disk is not secretly waiting to impede your ability to restore a pool’s redundancy. Not that you needed persuading! But hopefully your resilvering experience is better with ZFS.

Anyway, when pondering pool geometry, if I can’t have a mirror pool, I tend to re-read ZFS RAIDZ stripe width, or: How I Learned to Stop Worrying and Love RAIDZ, remembering that it’s from 2014, though, and like Jim, I would not use raidz1 with more than three disks of today’s sizes, let alone five (maybe not even yesterday’s, remembering that Xserve RAID, shudder), even with 512-byte sectors, tiny recordsizes, and no compression.

While I’m remembering things, I’ll also say I personally avoid making widths a multiple of the raidz level plus one, having once spent too much time trying to figure out why I seemed to have a hot spot on every fourth disk (so not, for example, a 6-wide Z2), but if that were actually worth worrying about, I imagine I would have seen someone else mention it, and I never have.

waltar · July 21, 2024, 3:21pm

“…An advantage of ZFS is that because it manages both the filesystem and the redundancy, it knows exactly which blocks need to be copied. Time saved depends on how full the pool is…”
Means on a tradional raid mirror system the layered filesystem ontop doesn’t matter and the rebuild-resilver were at good 160MB/s. I just replaced a 16TB hdd in a 4x6 raidz2 (working) pool which was 32% used and took 32h, so the advantage which “just should be resilvered” is just good when pool is quiet empty otherwise don’t be surprised. Mirrored vdev’s do better in resilver and much better in daily usage than raidz* ones.

blessendor · October 18, 2024, 9:17am

Found this post when I googled information about resilver a pool of several raidz1 and why all drives in the pool are scanning instead of 2 of 3 healthy drive in the same raidz1 vdev where 1 of 3 drive was replaced.

I thought my pool of raidz1-N is like a stripe of single disks, so why zfs require to scan through other disks in the stripe to fix the degradaded one of raidz1? Can some explain that zfs logic?

  pool: zstore
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Fri Oct 18 00:06:21 2024
        61.7T / 61.8T scanned, 10.3T / 21.5T issued at 377M/s
        3.44T resilvered, 47.84% done, 08:39:18 to go
config:

        NAME                                             STATE     READ WRITE CKSUM
        zstore                                           DEGRADED     0     0     0
          raidz1-0                                       ONLINE       0     0     0
            ata-HGST_HUS726060ALE610_NAG3V8MY-part4      ONLINE       0     0     0
            ata-HGST_HUS726060ALE610_K1J84LBD-part4      ONLINE       0     0     0
            ata-HGST_HUS726060ALE610_K1J86LBD-part4      ONLINE       0     0     0
          raidz1-1                                       ONLINE       0     0     0
            ata-HGST_HUS726060ALE610_K1J8AAUD-part4      ONLINE       0     0     0
            ata-HGST_HUS726T6TALE6L1_V8GKMVBR-part4      ONLINE       0     0     0
            ata-HGST_HUS726060ALE610_K1J88HWD-part4      ONLINE       0     0     0
          raidz1-2                                       DEGRADED     0     0     0
            ata-HGST_HUS726060ALE610_K1J847UD-part4      ONLINE       0     0     0
            replacing-1                                  DEGRADED     0     0     0
              5274282317183570780                        UNAVAIL      0     0     0  was /dev/disk/by-id/ata-HGST_HUS726060ALE610_K1H5ZE3D-part4
              ata-HGST_HUS726060ALE610_K8GG29UD-part4    ONLINE       0     0     0  (resilvering)
            ata-HGST_HUS726060ALE610_K1J86J4D-part4      ONLINE       0     0     0
          raidz1-3                                       ONLINE       0     0     0
            ata-HGST_HUS726060ALE610_K8HBJ4HN-part4      ONLINE       0     0     0
            ata-HGST_HUS726060ALE610_K1J6SLZD-part4      ONLINE       0     0     0
            ata-HGST_HUS726060ALE610_NCG7G9HS-part4      ONLINE       0     0     0
          raidz1-4                                       DEGRADED     0     0     0
            ata-TOSHIBA_MG04ACA600EY_27KVK008FTTB-part4  ONLINE       0     0     0
            ata-HGST_HUS726060ALE610_K1J8D3ND-part4      ONLINE       0     0     0
            replacing-2                                  DEGRADED     0     0     0
              10134250519501961335                       UNAVAIL      0     0     0  was /dev/disk/by-id/ata-HGST_HUS726060ALE610_K1GDTUZB-part4
              ata-HGST_HUS726060ALE610_NAGPZ4KY-part4    ONLINE       0     0     0  (resilvering)

errors: No known data errors

**zpool iostat -v 3**


-----------------------------------------------  -----  -----  -----  -----  -----  -----
                                                   capacity     operations     bandwidth 
pool                                             alloc   free   read  write   read  write
-----------------------------------------------  -----  -----  -----  -----  -----  -----
zstore                                           61.7T  18.7T  1.44K    500   373M   186M
  raidz1-0                                       13.4T  2.72T      0     30  2.66K   370K
    ata-HGST_HUS726060ALE610_NAG3V8MY-part4          -      -      0      9  1.33K   126K
    ata-HGST_HUS726060ALE610_K1J84LBD-part4          -      -      0     10      0   121K
    ata-HGST_HUS726060ALE610_K1J86LBD-part4          -      -      0     10  1.33K   122K
  raidz1-1                                       13.6T  2.52T      3      5  16.0K  63.9K
    ata-HGST_HUS726060ALE610_K1J8AAUD-part4          -      -      0      1  3.99K  21.3K
    ata-HGST_HUS726T6TALE6L1_V8GKMVBR-part4          -      -      0      2  3.99K  24.0K
    ata-HGST_HUS726060ALE610_K1J88HWD-part4          -      -      1      1  7.99K  18.6K
  raidz1-2                                       10.2T  5.90T    750    191   189M  93.8M
    ata-HGST_HUS726060ALE610_K1J847UD-part4          -      -    397      3  94.3M  28.0K
    replacing-1                                      -      -      0    184      0  93.7M
      5274282317183570780                            -      -      0      0      0      0
      ata-HGST_HUS726060ALE610_K8GG29UD-part4        -      -      0    184      0  93.7M
    ata-HGST_HUS726060ALE610_K1J86J4D-part4          -      -    353      3  94.8M  30.6K
  raidz1-3                                       13.5T  2.60T      5     29  37.3K   349K
    ata-HGST_HUS726060ALE610_K8HBJ4HN-part4          -      -      1      9  16.0K   120K
    ata-HGST_HUS726060ALE610_K1J6SLZD-part4          -      -      1      9  6.66K   113K
    ata-HGST_HUS726060ALE610_NCG7G9HS-part4          -      -      1      9  14.6K   116K
  raidz1-4                                       11.2T  4.92T    711    242   184M  91.6M
    ata-TOSHIBA_MG04ACA600EY_27KVK008FTTB-part4      -      -    311      9  92.2M   148K
    ata-HGST_HUS726060ALE610_K1J8D3ND-part4          -      -    400      8  91.3M   142K
    replacing-2                                      -      -      0    224      0  91.3M
      10134250519501961335                           -      -      0      0      0      0
      ata-HGST_HUS726060ALE610_NAGPZ4KY-part4        -      -      0    224      0  91.3M
-----------------------------------------------  -----  -----  -----  -----  -----  -----

mercenary_sysadmin · October 18, 2024, 2:50pm

I’m not sure I understand the question. I see two degraded vdevs in that pool, which means it needs to light up six drives (four read, two write) during this resilver.

What makes you think it’s lighting up the other three vdevs as part of the resilvering process? I only see significant activity on the two vdevs that are degraded:

zstore                                           61.7T  18.7T  1.44K    500   373M   186M
  raidz1-0                                       13.4T  2.72T      0     30  2.66K   370K
  raidz1-1                                       13.6T  2.52T      3      5  16.0K  63.9K
  raidz1-2                                       10.2T  5.90T    750    191   189M  93.8M
  raidz1-3                                       13.5T  2.60T      5     29  37.3K   349K
  raidz1-4                                       11.2T  4.92T    711    242   184M  91.6M

Out of 1.44K pool read operations, 750 happen on raidz1-2 and 711 happen on raidz1-4. Out of 500 pool write operations, 191 happen on raidz1-2 and 242 happen on raidz1-4.

And since raidz1-2 and raidz1-4 are your two degraded vdevs, this lines up with expectations: the only significant read or write operations going on during the moment in time you captured are happening on the degraded vdevs being resilvered.

blessendor · October 18, 2024, 3:07pm

You uderstood my question correctly, and seems you are right. Thanks
My both vdevs will be resilvered soon.

scan: resilver in progress since Fri Oct 18 00:06:21 2024
61.7T / 61.8T scanned, 19.9T / 21.5T issued at 434M/s
6.65T resilvered, 92.51% done, 01:04:49 to go

This is pretty good speed for 18 hours progress.