How does RAID-Z scale with large numbers of NVMe drives?

fengshui · May 26, 2024, 1:25am

The usual rule of thumb I’ve heard is that RAID-Z scales streaming writes at the speed of all drives, but has IOPS equal to a single drive. Is that still the case with NVMe? If I built something fairly silly, like a 12-drive RAIDZ1 vdev, would still only have the IOPS of one drive, or does the wild speed of NVMe mean that it starts performing better than the single-drive behavior of the old rust drives?

mercenary_sysadmin · May 26, 2024, 2:53am

You will be INCREDIBLY disappointed.

Word to the wise: very few real world workloads get the kind of simple throughput scaling you’re expecting, either, on any kind of underlying drives.

fengshui · May 26, 2024, 3:23am

Thanks, Jim. That’s about what I expected. I don’t even know if they’ll get up to 12 drives, but I’ll probably focus my testing on the difference between 6x2 ZFS mirrors and xfs-on-mdadm over anything RAID-Z.

mercenary_sysadmin · May 26, 2024, 11:11am

That sounds pretty reasonable IMO. You’re going to want striped mirrors either way, if the goal is maximum performance.

Typically, you won’t see much throughput scaling with any topology on NVMe, because NVMe can usually saturate the CPU’s ability to handle the data even on a single drive (assuming we’re talking non-gatbage NVMe, and you’ve got enough PCIe lanes to get the most of of the drive).

Where you can get scaling out of multiple NVMe drives is IOPS on the very low end. That requires multiple vdevs and a very parallel workload, and again assumes you’re not already saturating the CPU on throughput, but if you meet those assumptions you can get improved performance out of issuing your painfully small I/O requests across multiple devices.

You’re also going to need a massively parallel workload to get the most out of throughput, typically, because remember that CPU bottlenecking? If all the requests happen in a single thread, you bottleneck on the CPU on per-thread performance, whereas with a heavily parallel workload, you can bottleneck on CPU multi-thread performance.

At these rarified elevations, parallelization also becomes important for network performance: with 10Gbps and up, a single CPU thread can’t keep up with your NIC either, so you need multiple TCP threads in order to engage multiple CPU threads to get as much as possible out of your network’s theoretical throughput cap.

zfslover · August 28, 2024, 3:59am

What is the future of ZFS in terms of NVME performance? HDDs are obviously a legacy technology for most application use cases in the near future.

Are there any ZFS roadmap goals for better NVME compatibility?

My understanding is that ZFS is not great with NVME because there are so many fancy steps between the data and the disk (checksums, etc.) that do not work well with the very low latency NVME drives. Is this fundamental or will this change in the future?

Or will BTRFS just be the modern ZFS replacement in 10 years?

mercenary_sysadmin · August 28, 2024, 4:13am

ZFS has some trouble keeping up with NVMe (on workloads which NVMe can produce its very highest throughput) because NVMe works radically differently than the drives ZFS was originally designed for. Yes, efforts are underway to address the issue.

With that said… very few real-world workloads actually produce that extremely high NVMe throughput in the first place. It shows up real nice on easy-peasy top-end benchmarks that make the big numbers consumers love to see, not so much on most of the real-world workloads those same consumers are experiencing 90+% of the time.

zfslover · August 29, 2024, 12:36am

I am not as deep into the ZFS world as I would like to be. Is there a specific ZFS feature for NVME on the roadmap, or is it more of a collection of little tweaks?

Can you point me to a GitHub issue or name the new NVME feature? What is the expected timeline for ZFS to be truly NVME ready?

Do you think it is more realistic for BTRFS to progress and become stable, or for ZFS to get great NVME support? Which will happen first?

mercenary_sysadmin · August 29, 2024, 2:50am

I can’t provide you with anything that specific. I know that Allan Jude and his company Klara have specifically been working on NVMe performance issues, and I’m pretty sure they’re not the only ones (out of the active OpenZFS dev community) doing so, but I have not looked into specifics or asked for any timelines.

I don’t think btrfs is going to get significantly more stable. Its dev community just doesn’t seem to care much about stability in multiple-drive systems. Btrfs seems to be pretty solid now as a single-drive filesystem, but once you’re looking at an array, it’s in my opinion not much different (in terms of the actual experience, not necessarily the code) than it was ten years ago.

At this point, I suspect bcachefs is a better hope for an in-kernel ZFS alternative than btrfs is. But it’s still just a “hope” with no real timeline, and bcachefs is, itself, about a decade behind btrfs in terms of chronological dev time. So… fingers crossed, and all that, but I’m not placing any bets I’d have to pay up on unless you’re offering me pretty favorable odds.

zfslover · August 30, 2024, 1:39am

Yes, I remember him giving a talk about NVME performance on ZFS.

Thanks for the info!