Please comment / kibitz on my zpool decision choices

kneutron · January 1, 2024, 1:56am

This is my first really large physical pool (~50TB raw.) Finally fulfilled a bucket-list dream item and bought a SAS shelf on Ebay for ~$270 with shipping. (Link available if you’re interested)

zpool list

SIZE ALLOC FREE
47.3T 18.6T 28.8T

zfs list

ALLOC FREE
13.2T 20.3T

15-bay 3.5-inch SAS disk shelf connected to external-port HBA in IT mode (basic LSI2008 IIRC), 4TB SAS (+ a few SATA) all used / older disks out of warranty.

I used an inexpensive Kill-a-watt workalike and it’s only using ~140 watts with 14 drives spinning + 1/2 power supplies connected via dedicated UPS.

Pool layout:

draid2:7d:14c:1s-0

https://openzfs.github.io/openzfs-docs/Basic%20Concepts/dRAID%20Howto.html
[[

data - The number of data devices per redundancy group. In general a smaller value of D will increase IOPS, improve the compression ratio, and speed up resilvering at the expense of total usable capacity. Defaults to 8, unless N-P-S is less than 8.
]]

RAIDZ2 protection with 7 data disks per redundancy group (x2) 14 disks total, 1 virtual spare, no physical spare (yet) - waiting on disk shipment, had a used SAS drive fail SMART.

I had a choice between 2x 7-drive RAIDZ2, and DRAID; but non-draid would have limited my capacity a bit more. I have provision for multiple spares outside the shelf via external 5-bay enclosures. Consideration was given to possibly upgrading all drives in-place at some point in the future, but discarded as impractical. I can build a smaller array with larger drives later on and get around the same capacity (6x12TB or even 6x16TB when prices come down) - and this array is hardly 60% capacity so far, with pretty much everything in the house backed up (multiple bare-metal backups of OSX, MacOS, Linux, desktops, laptops, VMs, Music, Movies, Blurays, ISOs, etc.) In fact I may run czkawka and find out if I have any duplicate files left over.

Informal testing, this pool can sustain (4) disk failures with no data loss. With (5) failures the pool became unrecoverable(!) – even after a reboot and messing around with zpool import tricks.

I confess that I have an imperfect understanding of ZFS DRAID despite some fairly involved experiments in a VM when it came out. Hopefully working with this setup will help.

I already had a Samsung 860 Pro hooked up to a 4-bay (2.5-inch, in-case) + separate SAS HBA, so bought a Dell Enterprise 800GB SAS SSD so I could mirror the special + log devs and experiment with cache devs. (To my surprise those apparently cannot be mirrored.) I ended up adding cache devs (partitions over 2 separate disks) until finally the pool has ~120GB of cache over 3 devs/partitions.

I did set ’ sync=always ’ on a couple of datasets (virtualbox and a Samba share) so the log / ZIL devs are getting some I/O. This greatly limits throughput on those to under ~150MB/sec sustained.

Limited budget means partitions on the SSDs instead of dedicated devices for now, but I plan to get another Dell 800Gb Enterprise SSD in the near future.

This is not a “production” pool, but it is the tertiary “dumping ground” backup for all of my other pools in the homelab. If it dies, no great loss, it can be rebuilt just by copying data over again. This setup is basically in lieu of having a tape drive, as I had multiple 4TB drives available and bought quite a few more SAS 4TBs on ebay. The SATA drives are supposed to be replaced with SAS over time.

The shelf will probably be powered-on only 1-2 weekends a month to try and save on power bill, not running 24/7.

All disks have undergone burn-in testing, full DD write zeros + SMART Long Read test.

Throughput on scrub is ~1.4GiB/sec sustained. ~Half-full pool scrub completed in almost exactly (4) hours.

Sustained throughput copying from another 6x4TB (basic, no special devs) RAIDZ2 with Midnight Commander is ~440MB/sec (observed)

7.00TiB 5:42:10 [ 357MiB/s]
real 342m12.946s
^ This is with ZFS SEND + “pv” buffer, Midnight Commander usually has higher I/O

Copying from SSD is about the same, starts off higher with cache but levels out over time.

Default compression is (IIRC) set to zstd-3, which does have early-abort now but may affect throughput slightly, it’s only running on a Core-i7 with x8 CPUs and 32GB RAM.

I do monitor ’ zpool iostat ’ constantly, and some of the individual-disk I/Os drop every interval instead of being maxed-out. Wondering if I upgrade my HBA if this will improve things? The shelf is connected with a single SFF-8088 cable. The SAS drives are typically 512-sector while the SATA drives are 4k-sector; the pool is ashift=12.

Comments, advice, questions are welcome - Max I/O is not greatly important as this is all connected over 2.5Gbit Ethernet (and mostly 1Gbit for a majority of the house) but Resilver I/O could turn out to be important. I may spring for a new Toshiba N300 6TB (or larger) for the spare. Could also cannibalize other arrays for spare, but I plan to have a few extra drives lying around if needed.

mercenary_sysadmin · January 1, 2024, 4:25pm

Draid is not a good idea at your scale.

Informal testing, this pool can sustain (4) disk failures with no data loss. With (5) failures the pool became unrecoverable(!) – even after a reboot and messing around with zpool import tricks.

That’s because you had very little data on the pool. Once the pool has any significant amount of data on it, you will lose the entire pool after any third disk failure (just as you would if the whole thing was a single Z2 vdev).

You give up quite a lot when you go DRAID: expandability, dynamic blocksize, and more. I wouldn’t recommend even considering it until the sixty disk level or so… And keep in mind, it was designed for 96 disk systems.

Finally, there is ABSOLUTELY no point in DRAID without spare capacity (the :0 at the end of your DRAID topo) configured. The entire point of DRAID is minimizing the vulnerability window after a disk failure by resilvering onto spare capacity. If you aren’t using spares and want high capacity, you should just be running a pair of Z2 vdevs instead, for both better performance and improved survivability.