OpenZFS 2.3.0 Release--Direct I/O Question; Also I don't understand how distros package ZFS, apparently?

SinisterPisces · January 14, 2025, 4:32pm

See: Release zfs-2.3.0 · openzfs/zfs · GitHub

One of the features is Direct I/O, which bypasses the ARC cache. Apparently, this gives big performance improvements for NVME pools? Is there a recommended article for explaining exactly how this works?

I think I must not understand something, as I assumed RAM was always faster than NVME storage. Maybe that’s not always true anymore?
Alternatively, does the overhead of moving data in and out of the ARC cache really cancel out RAM’s speed advantage that hard?
Proxmox is still on ZFS 2.2.6, so I assume this feature isn’t there yet. I’m guessing Direct I/O will have good implications for using NVME mirrors for VM stores.

Though sometimes things get backported, which leads me to …

I’m really confused about ZFS feature packaging now. As an example, TrueNAS has had RAIDZ expansion for … at least part of Electric Eel 24.10’s life cycle (sometime in September 2024, according to Lawrence Systems). Was it back-ported for TrueNAS specifically, or in OpenZFS as a beta?

bladewdr · January 14, 2025, 11:56pm

ixSystems helped develop RAIDZ expansion, so they made the decision to include it early.

I’m sure they have their own internal OpenZFS base, which is ahead of the main work tree.

I can only assume that Direct I/O would mostly be beneficial for a workload that doesn’t include reading the same blocks frequently, thereby negating any benefit that the ARC would give you.

I’ll admit I don’t know a whole lot about how that feature works, but that’s what makes sense to me.

Probably not something that the average person will be turning on.

mercenary_sysadmin · January 15, 2025, 1:37am

iXsystems (TrueNAS vendor) absolutely keeps their own tree, and they pull features out of OpenZFS master and put them in prod on a regular basis; they frequently don’t even wait for new features to hit release status first. Not my favorite habit of theirs.

mercenary_sysadmin · January 15, 2025, 1:50am

Cache hits in RAM will be faster than pulling data off the metal, even on fast NVMe. The thing is, NVMe is direct I/O itself. The pipeline between the CPU and the storage stack is essentially the shortest possible straight line, by comparison to earlier storage pipelines like SATA.

I am really not an expert in this layer of things, but my understanding is that if you make your storage operations simple enough, it can pretty much be straight from the CPU to the metal. But once you start adding in complexities like caching, you may wind up needing to consult RAM quite a lot in addition to the actual storage ops, which can mean introducing a lot of additional CPU cycles. Making it worse, those extra CPU cycles are still taking place on the same CPU thread: so now you’re bottlenecking not just on CPU, but on single threaded CPU.

Again to the best of my understanding, this is mostly an issue with single threaded throughput bound operations (which is, in fact, where I see OpenZFS on NVMe bog down horribly by comparison to simpler filesystems like ext4). The heavier duty and more parallel the workload, in my experience, the smaller the delta between OpenZFS and ext4 / UFS2 / whatever on NVMe drives.

bladewdr · January 15, 2025, 2:08am

Yeah it’s kind of the reverse of distribution security fix backports.

SinisterPisces · January 15, 2025, 4:23am

Thanks for the context and additional info on this.

(Emphasis added).

So, since it’s not all operations, it would be something you’d enable per application/service/workload?

I’d think that most of the time the end user of a complex service/system like TrueNAS or Proxmox wouldn’t really know or have a way of knowing whether operations are bound by single-threaded thoroughput, so in those cases you’d be banking on sensible defaults or something in the docs along the lines of: “tune it like this if you’re running ZFS on an NVME pool.” So, mostly this isn’t something for end users (especially home/small office/homelab end users) to overthink. Does that sound reasonable?

(Apparently, MariaDB has a configuration file setting to enable Direct IO, but MariaDB seems to favor as little optimization as possible in favor of the server successfully starting if you install it on a potato with delusions of grandeur. Expecting the user to make those sorts of configuration changes only seems to be the norm in the SOHO sphere in database server configuration. )

adaptive_chance · January 15, 2025, 6:31pm

I’d want to benchmark my use case with direct I/O vs without.

Based on my loosely following the OpenZFS Github discussion… My understanding is direct I/O becomes more useful as NVMe device count increases. The overhead of ARC processing is significant when you have a half-dozen super fast storage devices that can be potentially accessed in parallel. With 2-3 cheap consumer-grade devices it might not matter as much.

It takes time to ping-pong data blocks across memory locations as they work their way through the I/O stack. In the olden days of spinning rust and SATA SSD’s this overhead wasn’t significant. In the modren age of PCIe 4.0/5.0 NVMe devices with pseudo-SLC write cache it starts to matter. Direct I/O reduces the number of ping-pongs.

My $.02 FWIW. I’m too broke to own and play with 4GB/sec devices so it’s all hearsay…

TopherIsSwell · January 16, 2025, 6:18am

According to the discussion here:

github.com/openzfs/zfs

NVMe Read Performance Issues with ZFS (submit_bio to io_schedule)

opened 05:20PM - 05 Feb 19 UTC

bwatkinson

Type: Performance

### System information Type | Version/Name --- | --- Distribution Name | …CentOS Distribution Version | 7.5 Linux Kernel | 4.18.20-100 Architecture | x86_64 ZFS Version | 0.7.12-1 SPL Version | 0.7.12-1 ### Describe the problem you're observing We are currently seeing poor read performance with ZFS 0.7.12 with our Samsung PM1725a devices in a Dell PowerEdge R7425. ### Describe how to reproduce the problem Briefly describing our setup, we currently have four PM1724a devices attached to the PCIe root complex in NUMA domain 1 on an AMD EPYC 7401 processor. In order to measure read throughput of ZFS, XFS, and the Raw Devices, the XDD tool was used which is available at: git@github.com:bwatkinson/xdd.git In all cases I am presenting, kernel 4.18.20-100 was used and I disabled all CPU's not on socket 0 within the kernel. I also issued asynchronous sequential reads to the file systems/devices while pinning all XDD threads to NUMA domain 1 and Socket 0's memory banks. I conducted four tests consisting of measuring throughput for the raw devices, XFS, and ZFS 0.7.12. For the Raw Device tests, I had 6 I/O threads per devices with a request sizes of 1 MB and a total of 32 GB read from each device using Direct I/O. In the XFS case, I created a single XFS file system on each of the 4 devices. In each of the XFS file systems, I read a 32 GB file of random data using 6 I/O threads per file with request sizes of 1 MB using Direct I/O. In the ZFS Single ZPool case I created a single ZPool composed of 4 VDEVs and read 128 GB file of random data using 24 I/O threads with request sizes of 1 MB. In the ZFS Multiple ZPool case I create 4 separate ZPools each consisting of a single VDEV. In each of the ZPools, I read a 32 GB file of random data using 6 I/O threads per file with request sizes of 1 MB. In both the Single ZPool and Multipl ZPool cases I set the record sizes for all pools to 1 MB and I set the primarycache=none. We decided to disable the ARC in all cases, because we were reading 128 GB of data, which was exactly equal to 2x available memory on Socket 0. Even with the ARC enabled we were seeing no performance benefits. Below are the throughput measurements I collected for each of these case. Performance Results: Raw Device - 12,7246.724 MB/s XFS Direct I/O - 12,734.633 MB/s ZFS Single Zpool - 6,452.946 MB/s ZFS Multiple Zpool – 6,344.915 MB/s In order to try and solve what was cutting ZFS read performance in half, I generated flame graphs using the following tool: http://www.brendangregg.com/flamegraphs.html In general I found most of the perf samples were occurring in zio_wait. It is in this call that io_scheudle is called. Comparing the ZFS flame graphs to XFS flame graphs, I found that the number of samples between the submit_bio and io_schedule was significantly larger in the ZFS case. I decided to take timestamps of each call to io_schedule for both ZFS and XFS to measure the latency between the calls. Below is a link to to histograms as well as total elapsed time in microseconds between io_schedule calls for the tests I described above. In total I collected 110,000 timestamps. In plotted data the first 10,000 timestamps were ignored to allow for the file systems to reach a steady state. https://drive.google.com/drive/folders/1t7tLXxyhurcxGjry4WYkw_wryh6eMqdh?usp=sharing In general, ZFS has a significant latency between io_scheudule calls. I have also verified that the output from iostat shows a larger await_r value for ZFS over XFS for these tests as well. In general it seems ZFS is letting request sit in the hardware queues longer than XFS and the Raw Devices causing a huge performance penalty in ZFS reads (effectively cutting the available device bandwidth in half). In general, has this issue been noticed with NVMe SSD’s and ZFS and is there a current fix to this issue? If there is no current fix, is this issue being worked on? Also, I have tried to duplicate the XFS results using ZFS 0.8.0-rc2 using Direct I/O, but the performance for 0.8.0 read almost exactly matched ZFS 0.7.12 read performance without Direct I/O. ### Include any warning/errors/backtraces from the system logs

in highly-concurrent workloads… the overhead in managing dynamic taskqs can become rather significant

Which suggests that the zio is trying to overoptimize the I/O scheduling and taking too long to pull the blocks off the disk, which would suggest to me that this wouldn’t be limited to single-threaded workloads. But this is out of my wheelhouse, so I may be misapprehending.

Also, this only has any effect if the application is reading the file with the O_DIRECT flag, which I’ve always thought of as the “I don’t need no stinkin’ OS caching” flag, but maybe it has different meaning now in the age of faster storage. IIUC, O_DIRECT (when implemented in the filesystem) has the kernel copy the blocks off the disk directly into a userspace buffer and avoid the memcpy’s and context switches needed to copy from the disk to kernelspace to userspace (and as a result cause the kernel not to add the blocks to the kernel disk cache). So I’m only used to seeing this flag used by programs that perform their own caching (or if they know something won’t be accessed again and want to signal to the kernel not to cache it). As you mentioned, databases are an excellent example of an application that can cache more intelligently in userspace than relying on the kernel disk cache.

mercenary_sysadmin · January 16, 2025, 4:20pm

Generally speaking, yes. Although in practice, I suspect you’d find the smartest use cases in dedicated server type setups instead: isolate the workload a single system manages to only the stuff that direct io enhances, turn it on for that server, and as much of the workload as would NOT be helped significantly, move to another (possibly cheaper, possibly not NVMe) server.

The workload I know of that’s REALLY benefited by this is AI dataset ingestion, on massive hardware. You can use this to eke more performance out of an NVMe drive for a specific workload on a home machine, sure, but the question is, how many of those workloads do you have, and how many of them need the theoretical maximum throughput of your NVMe drive (which it likely can’t produce for long anyway)?

mercenary_sysadmin · January 16, 2025, 4:27pm

This is kind of a contentious statement. You can see the issue even on a single consumer M.2 NVMe drive. It’s just not much of a REALISTIC issue, since those drives generally can’t hit anything approximating their theoretical maximum throughput on anything but the kind of artificial workload that people use to generate big numbers and try to claim bragging rights.

In actual practice, it is not AT ALL uncommon to see even fairly humble prosumer SATA SSDs absolutely mop the floor with consumer M.2 drives claiming orders or magnitude higher throughput. For one thing, as mentioned, your workload is not the same workload used to generate the big numbers in either the M.2 drive’s marketing OR the SATA drive’s marketing! And for another, M.2 is absolutely notorious for thermal throttling, as well as the usual cache exhaustion issues, which comes as a deeply unpleasant surprise the first time you want to use the drive for a heavy workload for more than 10-15 concurrent seconds.

I’m not telling y’all “never ever buy M.2, it’s universally awful” but I am absolutely cautioning people to stop thinking that M.2 drives are automatically, or even usually, a massive cut above SATA drives on demanding workloads!

adaptive_chance · January 16, 2025, 10:14pm

I thought about this thread when I stumbled across this USENIX paper (pg. 202 in the pdf).

Has me wondering if ZFS has an unintentional knack for triggering these “die-level collisions.” Some of the performance data in this paper is shockingly bad. This study seems right up your alley if you haven’t seen it already…

Has me wondering if all these devices that don’t seem to get any faster at ashift >/=13 vs 9/12 might actually become faster with the right (wrong) write pattern.

mercenary_sysadmin · January 17, 2025, 1:28am

Yeah, they kinda tend to. There’s no way in hell that NVMe drives actually use 512B “sectors”, but in my experience the VAST majority of them perform the best–and the most consistently–with ashift=9.

SinisterPisces · January 17, 2025, 1:49am

There’s no way in hell that NVMe drives actually use 512B “sectors”, but in my experience the VAST majority of them perform the best–and the most consistently–with ashift=9.

Well, that’s disturbing.

Proxmox defaults to ashift=12 and all the ZFS literature that comes up in web searches hammers on the dangers of write amplication with ashift=9.

I’ve never tried ashift=13, in part because I tend to buy less bleeding-edge/older NVME.

In retrospect, this is good. I was starting to feel too confident with ZFS, anyway.

Please don’t tell me I should recreate all my NVME pools.

You can use this to eke more performance out of an NVMe drive for a specific workload on a home machine, sure, but the question is, how many of those workloads do you have , and how many of them need the theoretical maximum throughput of your NVMe drive (which it likely can’t produce for long anyway)?

Right now, my heaviest workloads are probably hosting QEMU/KVM VM and LXCs, and hosting a home-server scale instance of MariaDB and MySQL 8, which mostly do reads. It’s sounding more and more like this is a level of optimization I don’t really need to consider unless a tutorial tells me to turn it on or something.

My drives are already speed-constrained by being in downgraded NVME slots (I hate modern prosumer motherboards), so I’m not already not capable of pushing a pair of PCIe 4.0x4 drives to their limits. (One is PCIe 4.0x4, and the other is like, PCIe 3.0x1 or something). And they’re still more than enough to saturate the 2x10 Gbps LACP bond that connects that NAS to the outside world.

adaptive_chance · January 17, 2025, 2:20am

I do 12 across the board. I don’t care if 9 performs a bit better or improves compression ratios. 512 byte logical sectors are going away (sooner or later) and I don’t want to be sitting here with an ashift=9 pool when UPS delivers a 4kn replacement drive.

The benchmarking I ran two years ago when I built my pool didn’t show any real difference one way or the other. I’ve tried 13 against every SSD in the house and didn’t see any improvement there either.

mercenary_sysadmin · January 17, 2025, 3:09pm

Ashift=9 makes compression worse, not better. But, more to the real point:

If you aren’t unhappy with your pool’s performance in actual daily use, there is no point in tearing it apart just to rebuild it slightly faster when you were already happy with it anyway.

SinisterPisces · January 18, 2025, 1:55am

I need to print this out and stick it on the wall over my NAS, just to remind myself every once in awhile when I fall into an optimization spiral.

mabod · January 18, 2025, 4:36pm

The announcement that TrueNASS supports raidz expasion is from 29. October 2024:

And what I understand what they did is implementing the openzfs pull request which was already created in 2023.

github.com/openzfs/zfs

raidz expansion feature

openzfs:master ← don-brady:raidz-expansion

offen 03:15PM - 29 Jun 23 UTC

don-brady

+5737 -873

### Motivation and Context This feature allows disks to be added one at a time …to a RAID-Z group, expanding its capacity incrementally. This feature is especially useful for small pools (typically with only one RAID-Z group), where there isn't sufficient hardware to add capacity by adding a whole new RAID-Z group (typically doubling the number of disks). For additional context as well as a design overview, see Matt Ahrens' talk at the [2021 FreeBSD Developer Summit](https://wiki.freebsd.org/DevSummit/202106) ([video](https://youtu.be/3SUKJye54aI?t=6166)) ([slides](https://docs.google.com/presentation/d/1FeQgEwChrtNQBHfWSNsPK3Y53O5BnPh3Cz5nRa5GAQY/edit#slide=id.p)), and a [news article from Ars Technica](https://arstechnica.com/gadgets/2021/06/raidz-expansion-code-lands-in-openzfs-master/). ### Description #### Initiating expansion A new device (disk) can be attached to an existing RAIDZ vdev, by running `zpool attach POOL raidzP-N NEW_DEVICE`, e.g. `zpool attach tank raidz2-0 sda`. The new device will become part of the RAIDZ group. A _raidz expansion_ will be initiated, and the new device will contribute additional space to the RAIDZ group once the expansion completes. The `feature@raidz_expansion` on-disk feature flag must be `enabled` to initiate an expansion, and it remains `active` for the life of the pool. In other words, pools with expanded RAIDZ vdevs can not be imported by older releases of the ZFS software. #### During expansion The expansion entails reading all allocated space from existing disks in the RAIDZ group, and rewriting it to the new disks in the RAIDZ group (including the newly added device). The expansion progress can be monitored with `zpool status`. Data redundancy is maintained during (and after) the expansion. If a disk fails while the expansion is in progress, the expansion pauses until the health of the RAIDZ vdev is restored (e.g. by replacing the failed disk and waiting for reconstruction to complete). The pool remains accessible during expansion. Following a reboot or export/import, the expansion resumes where it left off. #### After expansion When the expansion completes, the additional space is available for use, and is reflected in the `available` zfs property (as seen in `zfs list`, `df`, etc). Expansion does not change the number of failures that can be tolerated without data loss (e.g. a RAIDZ2 is still a RAIDZ2 even after expansion). A RAIDZ vdev can be expanded multiple times. After the expansion completes, old blocks remain with their old data-to-parity ratio (e.g. 5-wide RAIDZ2, has 3 data to 2 parity), but distributed among the larger set of disks. New blocks will be written with the new data-to-parity ratio (e.g. a 5-wide RAIDZ2 which has been expanded once to 6-wide, has 4 data to 2 parity). However, the RAIDZ vdev's "assumed parity ratio" does not change, so slightly less space than is expected may be reported for newly-written blocks, according to `zfs list`, `df`, `ls -s`, and similar tools. #### Manpage changes zpool-attach.8: ``` NAME zpool-attach — attach new device to existing ZFS vdev SYNOPSIS zpool attach [-fsw] [-o property=value] pool device new_device DESCRIPTION Attaches new_device to the existing device. The behavior differs depend‐ ing on if the existing device is a RAIDZ device, or a mirror/plain device. If the existing device is a mirror or plain device ... If the existing device is a RAIDZ device (e.g. specified as "raidz2-0"), the new device will become part of that RAIDZ group. A "raidz expansion" will be initiated, and the new device will contribute additional space to the RAIDZ group once the expansion completes. The expansion entails reading all allocated space from existing disks in the RAIDZ group, and rewriting it to the new disks in the RAIDZ group (including the newly added device). Its progress can be monitored with zpool status. Data redundancy is maintained during and after the expansion. If a disk fails while the expansion is in progress, the expansion pauses until the health of the RAIDZ vdev is restored (e.g. by replacing the failed disk and waiting for reconstruction to complete). Expansion does not change the number of failures that can be tolerated without data loss (e.g. a RAIDZ2 is still a RAIDZ2 even after expansion). A RAIDZ vdev can be expanded multiple times. After the expansion completes, old blocks remain with their old data-to- parity ratio (e.g. 5-wide RAIDZ2, has 3 data to 2 parity), but distrib‐ uted among the larger set of disks. New blocks will be written with the new data-to-parity ratio (e.g. a 5-wide RAIDZ2 which has been expanded once to 6-wide, has 4 data to 2 parity). However, the RAIDZ vdev's "assumed parity ratio" does not change, so slightly less space than is expected may be reported for newly-written blocks, according to zfs list, df, ls -s, and similar tools. ``` #### Status Matt Ahrens' original pull request (#12225) has been rebased here to current master branch and updated to incorporate recent code cleanups in the OpenZFS codebase. This feature is believed to be complete. However, like all PR's, it is subject to change as part of the code review process. Since this PR includes on-disk changes, it shouldn't be used on production systems before it is integrated to the OpenZFS codebase. Tasks that still need to be done before integration: - [x] Additional code cleanup in `ztest` code - [x] `zloop` changes to drive coverage of this feature - [ ] Address test failures in `ztest` runs - [x] Document the high-level design in a "big theory statement" comment - [x] Remove verbose logging - [x] Detection of MBR partitions using reserved boot area (FreeBSD BTX boot loader) - [ ] Address any performance concerns ### Acknowledgments Thank you to the [FreeBSD Foundation](https://freebsdfoundation.org/) for commissioning this work in 2017 and continuing to sponsor it well past the original time estimates! Thank you to [iXsystems](https://www.ixsystems.com/) for sponsoring the final push to land this feature into OpenZFS. Thanks also to contributors @FedorUporovVstack, @stuartmaybee, @thorsteneb, and @Fmstrat for portions of the implementation. Sponsored-by: The FreeBSD Foundation Sponsored-by: iXsystems, Inc. Sponsored-by: vStack Contributions-by: Stuart Maybee [stuart.maybee@comcast.net](mailto:stuart.maybee@comcast.net) Contributions-by: Fedor Uporov [fuporov.vstack@gmail.com](mailto:fuporov.vstack@gmail.com) Contributions-by: Thorsten Behrens [tbehrens@outlook.com](mailto:tbehrens@outlook.com) Contributions-by: Fmstrat [nospam@nowsci.com](mailto:nospam@nowsci.com) Contributions-by: Don Brady <dev.fs.zfs@gmail.com> ### How Has This Been Tested? Tests added to the ZFS Test Suite (`functional/raidz`) and `ztest`, in addition to manual testing. ### Types of changes - [ ] Bug fix (non-breaking change which fixes an issue) - [x] New feature (non-breaking change which adds functionality) - [ ] Performance enhancement (non-breaking change which improves efficiency) - [ ] Code cleanup (non-breaking change which makes code smaller or more readable) - [ ] Breaking change (fix or feature that would cause existing functionality to change) - [ ] Library ABI change (libzfs, libzfs\_core, libnvpair, libuutil and libzfsbootenv) - [ ] Documentation (a change to man pages or other documentation) ### Checklist: - [x] I have updated the documentation accordingly. - [x] I have read the [**contributing** document](https://github.com/openzfs/zfs/blob/master/.github/CONTRIBUTING.md). - [x] I have added [tests](https://github.com/openzfs/zfs/tree/master/tests) to cover my changes. - [ ] I have run the ZFS Test Suite with this change applied. - [ ] All commit messages are properly formatted and contain [`Signed-off-by`](https://github.com/openzfs/zfs/blob/master/.github/CONTRIBUTING.md#signed-off-by). ### Pull Request Comments Please limit comments here to code review/feedback and testing questions/results. For generic discussions about RAID-Z, or discussions on future enhancements to RAIDZ expansion, please use [RAIDZ Expansion feature discussions](https://github.com/openzfs/zfs/discussions/15232).

This PR was rebased and approved in October 2024. Looks like TrueNAS didnt want to wait for the next big release containing that PR. They just implemented it in their product.

Long story short: raidz expansion is all the same everywhere.

ZeroSignal9 · January 20, 2025, 9:58pm

I have a couple WD SN850X NVMe SSDs. They come out of the box set to 512 sectors (for legacy OSes), and need to be manually set to 4k sectors.

SinisterPisces · January 27, 2025, 2:26am

This actually came up on an episode of the TrueNAS/iX Systems podcast: https://www.youtube.com/watch?v=WxDdKrLGK5Y

I got some clarification in the comments.

Me:

How much will the user need to know about their workload to use the new Direct IO feature? For example, if I’ve got an NVME mirror pool that Proxmox accesses to store VMs (over either iSCSI or NFS), how much configuration will I need to do on TrueNAS to get DirectIO working?

Is the implementation of DirectIO similar to the default sync setting on datasets (default: client indicates the kind of sync it wants), where the client indicates whether the DirectIO feature should be toggled on or not? In that case it would seem like the user would just need to handle configuring the client (Proxmox, *SQL, etc.) to use DirectIO.

iX:

The default behavior in OpenZFS 2.3 is “direct=standard” where it will handle recordsize-aligned IO through the Direct IO path - unaligned IO will still take the buffered path through the ARC. Tuning your recordsize or volblocksize (for ZVOLs) and client filesystem or file chunk size will probably be important in order to ensure it’s actually taking that “shortcut” - we’ll likely have more to share on this in the future.
…
Share away - although I (Chris) did misspeak on one part of it, anything that gets a “proper” Direct IO read won’t land in the ARC after service to client. See the PR for more details: https://github.com/openzfs/zfs/pull/10018

bladewdr · January 28, 2025, 9:16pm

But the numbers, Jim! They must go up!