OpenZFS 2.3.0 Release--Direct I/O Question; Also I don't understand how distros package ZFS, apparently?

See: Release zfs-2.3.0 · openzfs/zfs · GitHub

One of the features is Direct I/O, which bypasses the ARC cache. Apparently, this gives big performance improvements for NVME pools? Is there a recommended article for explaining exactly how this works?

  1. I think I must not understand something, as I assumed RAM was always faster than NVME storage. Maybe that’s not always true anymore?
  2. Alternatively, does the overhead of moving data in and out of the ARC cache really cancel out RAM’s speed advantage that hard?
  3. Proxmox is still on ZFS 2.2.6, so I assume this feature isn’t there yet. I’m guessing Direct I/O will have good implications for using NVME mirrors for VM stores.

Though sometimes things get backported, which leads me to …

I’m really confused about ZFS feature packaging now. As an example, TrueNAS has had RAIDZ expansion for … at least part of Electric Eel 24.10’s life cycle (sometime in September 2024, according to Lawrence Systems). Was it back-ported for TrueNAS specifically, or in OpenZFS as a beta?

ixSystems helped develop RAIDZ expansion, so they made the decision to include it early.

I’m sure they have their own internal OpenZFS base, which is ahead of the main work tree.

I can only assume that Direct I/O would mostly be beneficial for a workload that doesn’t include reading the same blocks frequently, thereby negating any benefit that the ARC would give you.

I’ll admit I don’t know a whole lot about how that feature works, but that’s what makes sense to me.

Probably not something that the average person will be turning on.

1 Like

iXsystems (TrueNAS vendor) absolutely keeps their own tree, and they pull features out of OpenZFS master and put them in prod on a regular basis; they frequently don’t even wait for new features to hit release status first. Not my favorite habit of theirs.

1 Like

Cache hits in RAM will be faster than pulling data off the metal, even on fast NVMe. The thing is, NVMe is direct I/O itself. The pipeline between the CPU and the storage stack is essentially the shortest possible straight line, by comparison to earlier storage pipelines like SATA.

I am really not an expert in this layer of things, but my understanding is that if you make your storage operations simple enough, it can pretty much be straight from the CPU to the metal. But once you start adding in complexities like caching, you may wind up needing to consult RAM quite a lot in addition to the actual storage ops, which can mean introducing a lot of additional CPU cycles. Making it worse, those extra CPU cycles are still taking place on the same CPU thread: so now you’re bottlenecking not just on CPU, but on single threaded CPU.

Again to the best of my understanding, this is mostly an issue with single threaded throughput bound operations (which is, in fact, where I see OpenZFS on NVMe bog down horribly by comparison to simpler filesystems like ext4). The heavier duty and more parallel the workload, in my experience, the smaller the delta between OpenZFS and ext4 / UFS2 / whatever on NVMe drives.

1 Like

Yeah it’s kind of the reverse of distribution security fix backports. :grin:

Thanks for the context and additional info on this. :slight_smile:

(Emphasis added).

So, since it’s not all operations, it would be something you’d enable per application/service/workload?

I’d think that most of the time the end user of a complex service/system like TrueNAS or Proxmox wouldn’t really know or have a way of knowing whether operations are bound by single-threaded thoroughput, so in those cases you’d be banking on sensible defaults or something in the docs along the lines of: “tune it like this if you’re running ZFS on an NVME pool.” So, mostly this isn’t something for end users (especially home/small office/homelab end users) to overthink. Does that sound reasonable?

(Apparently, MariaDB has a configuration file setting to enable Direct IO, but MariaDB seems to favor as little optimization as possible in favor of the server successfully starting if you install it on a potato with delusions of grandeur. Expecting the user to make those sorts of configuration changes only seems to be the norm in the SOHO sphere in database server configuration. :stuck_out_tongue: )

1 Like

I’d want to benchmark my use case with direct I/O vs without.

Based on my loosely following the OpenZFS Github discussion… My understanding is direct I/O becomes more useful as NVMe device count increases. The overhead of ARC processing is significant when you have a half-dozen super fast storage devices that can be potentially accessed in parallel. With 2-3 cheap consumer-grade devices it might not matter as much.

It takes time to ping-pong data blocks across memory locations as they work their way through the I/O stack. In the olden days of spinning rust and SATA SSD’s this overhead wasn’t significant. In the modren age of PCIe 4.0/5.0 NVMe devices with pseudo-SLC write cache it starts to matter. Direct I/O reduces the number of ping-pongs.

My $.02 FWIW. I’m too broke to own and play with 4GB/sec devices so it’s all hearsay…

According to the discussion here:

in highly-concurrent workloads… the overhead in managing dynamic taskqs can become rather significant

Which suggests that the zio is trying to overoptimize the I/O scheduling and taking too long to pull the blocks off the disk, which would suggest to me that this wouldn’t be limited to single-threaded workloads. But this is out of my wheelhouse, so I may be misapprehending.

Also, this only has any effect if the application is reading the file with the O_DIRECT flag, which I’ve always thought of as the “I don’t need no stinkin’ OS caching” flag, but maybe it has different meaning now in the age of faster storage. IIUC, O_DIRECT (when implemented in the filesystem) has the kernel copy the blocks off the disk directly into a userspace buffer and avoid the memcpy’s and context switches needed to copy from the disk to kernelspace to userspace (and as a result cause the kernel not to add the blocks to the kernel disk cache). So I’m only used to seeing this flag used by programs that perform their own caching (or if they know something won’t be accessed again and want to signal to the kernel not to cache it). As you mentioned, databases are an excellent example of an application that can cache more intelligently in userspace than relying on the kernel disk cache.

Generally speaking, yes. Although in practice, I suspect you’d find the smartest use cases in dedicated server type setups instead: isolate the workload a single system manages to only the stuff that direct io enhances, turn it on for that server, and as much of the workload as would NOT be helped significantly, move to another (possibly cheaper, possibly not NVMe) server.

The workload I know of that’s REALLY benefited by this is AI dataset ingestion, on massive hardware. You can use this to eke more performance out of an NVMe drive for a specific workload on a home machine, sure, but the question is, how many of those workloads do you have, and how many of them need the theoretical maximum throughput of your NVMe drive (which it likely can’t produce for long anyway)?

This is kind of a contentious statement. You can see the issue even on a single consumer M.2 NVMe drive. It’s just not much of a REALISTIC issue, since those drives generally can’t hit anything approximating their theoretical maximum throughput on anything but the kind of artificial workload that people use to generate big numbers and try to claim bragging rights.

In actual practice, it is not AT ALL uncommon to see even fairly humble prosumer SATA SSDs absolutely mop the floor with consumer M.2 drives claiming orders or magnitude higher throughput. For one thing, as mentioned, your workload is not the same workload used to generate the big numbers in either the M.2 drive’s marketing OR the SATA drive’s marketing! And for another, M.2 is absolutely notorious for thermal throttling, as well as the usual cache exhaustion issues, which comes as a deeply unpleasant surprise the first time you want to use the drive for a heavy workload for more than 10-15 concurrent seconds.

I’m not telling y’all “never ever buy M.2, it’s universally awful” but I am absolutely cautioning people to stop thinking that M.2 drives are automatically, or even usually, a massive cut above SATA drives on demanding workloads!

I thought about this thread when I stumbled across this USENIX paper (pg. 202 in the pdf).

Has me wondering if ZFS has an unintentional knack for triggering these “die-level collisions.” Some of the performance data in this paper is shockingly bad. This study seems right up your alley if you haven’t seen it already…

Has me wondering if all these devices that don’t seem to get any faster at ashift >/=13 vs 9/12 might actually become faster with the right (wrong) write pattern.

1 Like

Yeah, they kinda tend to. There’s no way in hell that NVMe drives actually use 512B “sectors”, but in my experience the VAST majority of them perform the best–and the most consistently–with ashift=9.

1 Like

There’s no way in hell that NVMe drives actually use 512B “sectors”, but in my experience the VAST majority of them perform the best–and the most consistently–with ashift=9.

Well, that’s disturbing.

Proxmox defaults to ashift=12 and all the ZFS literature that comes up in web searches hammers on the dangers of write amplication with ashift=9.

I’ve never tried ashift=13, in part because I tend to buy less bleeding-edge/older NVME.

In retrospect, this is good. I was starting to feel too confident with ZFS, anyway. :stuck_out_tongue:

Please don’t tell me I should recreate all my NVME pools. :cry:

You can use this to eke more performance out of an NVMe drive for a specific workload on a home machine, sure, but the question is, how many of those workloads do you have , and how many of them need the theoretical maximum throughput of your NVMe drive (which it likely can’t produce for long anyway)?

Right now, my heaviest workloads are probably hosting QEMU/KVM VM and LXCs, and hosting a home-server scale instance of MariaDB and MySQL 8, which mostly do reads. It’s sounding more and more like this is a level of optimization I don’t really need to consider unless a tutorial tells me to turn it on or something.

My drives are already speed-constrained by being in downgraded NVME slots (I hate modern prosumer motherboards), so I’m not already not capable of pushing a pair of PCIe 4.0x4 drives to their limits. (One is PCIe 4.0x4, and the other is like, PCIe 3.0x1 or something). And they’re still more than enough to saturate the 2x10 Gbps LACP bond that connects that NAS to the outside world.

I do 12 across the board. I don’t care if 9 performs a bit better or improves compression ratios. 512 byte logical sectors are going away (sooner or later) and I don’t want to be sitting here with an ashift=9 pool when UPS delivers a 4kn replacement drive.

The benchmarking I ran two years ago when I built my pool didn’t show any real difference one way or the other. I’ve tried 13 against every SSD in the house and didn’t see any improvement there either.

2 Likes

Ashift=9 makes compression worse, not better. But, more to the real point:

If you aren’t unhappy with your pool’s performance in actual daily use, there is no point in tearing it apart just to rebuild it slightly faster when you were already happy with it anyway.

1 Like

I need to print this out and stick it on the wall over my NAS, just to remind myself every once in awhile when I fall into an optimization spiral. :stuck_out_tongue:

2 Likes

The announcement that TrueNASS supports raidz expasion is from 29. October 2024:

And what I understand what they did is implementing the openzfs pull request which was already created in 2023.

This PR was rebased and approved in October 2024. Looks like TrueNAS didnt want to wait for the next big release containing that PR. They just implemented it in their product.

Long story short: raidz expansion is all the same everywhere.

I have a couple WD SN850X NVMe SSDs. They come out of the box set to 512 sectors (for legacy OSes), and need to be manually set to 4k sectors.

This actually came up on an episode of the TrueNAS/iX Systems podcast: https://www.youtube.com/watch?v=WxDdKrLGK5Y

I got some clarification in the comments.

Me:

How much will the user need to know about their workload to use the new Direct IO feature? For example, if I’ve got an NVME mirror pool that Proxmox accesses to store VMs (over either iSCSI or NFS), how much configuration will I need to do on TrueNAS to get DirectIO working?

Is the implementation of DirectIO similar to the default sync setting on datasets (default: client indicates the kind of sync it wants), where the client indicates whether the DirectIO feature should be toggled on or not? In that case it would seem like the user would just need to handle configuring the client (Proxmox, *SQL, etc.) to use DirectIO.

iX:

The default behavior in OpenZFS 2.3 is “direct=standard” where it will handle recordsize-aligned IO through the Direct IO path - unaligned IO will still take the buffered path through the ARC. Tuning your recordsize or volblocksize (for ZVOLs) and client filesystem or file chunk size will probably be important in order to ensure it’s actually taking that “shortcut” - we’ll likely have more to share on this in the future.

Share away - although I (Chris) did misspeak on one part of it, anything that gets a “proper” Direct IO read won’t land in the ARC after service to client. See the PR for more details: https://github.com/openzfs/zfs/pull/10018

1 Like

But the numbers, Jim! They must go up!

:smiley: