How does RAID-Z scale with large numbers of NVMe drives?

The usual rule of thumb I’ve heard is that RAID-Z scales streaming writes at the speed of all drives, but has IOPS equal to a single drive. Is that still the case with NVMe? If I built something fairly silly, like a 12-drive RAIDZ1 vdev, would still only have the IOPS of one drive, or does the wild speed of NVMe mean that it starts performing better than the single-drive behavior of the old rust drives?

You will be INCREDIBLY disappointed.

Word to the wise: very few real world workloads get the kind of simple throughput scaling you’re expecting, either, on any kind of underlying drives.

Thanks, Jim. That’s about what I expected. I don’t even know if they’ll get up to 12 drives, but I’ll probably focus my testing on the difference between 6x2 ZFS mirrors and xfs-on-mdadm over anything RAID-Z.

1 Like

That sounds pretty reasonable IMO. You’re going to want striped mirrors either way, if the goal is maximum performance.

Typically, you won’t see much throughput scaling with any topology on NVMe, because NVMe can usually saturate the CPU’s ability to handle the data even on a single drive (assuming we’re talking non-gatbage NVMe, and you’ve got enough PCIe lanes to get the most of of the drive).

Where you can get scaling out of multiple NVMe drives is IOPS on the very low end. That requires multiple vdevs and a very parallel workload, and again assumes you’re not already saturating the CPU on throughput, but if you meet those assumptions you can get improved performance out of issuing your painfully small I/O requests across multiple devices.

You’re also going to need a massively parallel workload to get the most out of throughput, typically, because remember that CPU bottlenecking? If all the requests happen in a single thread, you bottleneck on the CPU on per-thread performance, whereas with a heavily parallel workload, you can bottleneck on CPU multi-thread performance.

At these rarified elevations, parallelization also becomes important for network performance: with 10Gbps and up, a single CPU thread can’t keep up with your NIC either, so you need multiple TCP threads in order to engage multiple CPU threads to get as much as possible out of your network’s theoretical throughput cap.