What % disk bandwidth is feasible for a guest filesystem?

muay_throwaway · May 27, 2025, 10:57pm

I have a RAIDZ2 array (on Ubuntu 24.04.2 LTS with zfs-2.2.2-0ubuntu9.2) where the main host ZFS filesystem reaches ~400 MB/s, whereas a guest filesystem on this array (truncate or fallocate then mkfs.ext4 and mount) reaches ~170 MB/s (measured by sudo fio --name=seqwrite --rw=write --direct=1 --ioengine=libaio --bs=1M --numjobs=1 --size=10G --runtime=600 --group_reporting; lower but comparable results with --ioengine=sync).

This got me thinking: how close can the guest filesystem get to the host in terms of disk bandwidth? What disk bandwidth have you achieved for a guest filesystem relative to the host ZFS filesystem through performance tuning? (I am more concerned with sequential disk bandwidth rather than IOPS in this use case due to the data involved.)

This is more a discussion topic to see what is achievable in anecdotes/anecdata rather than trying to optimize this system in particular.

muay_throwaway · May 27, 2025, 11:13pm

This includes cases involving filesystems residing on zvols or qcow2 images, of course. I have seen quite a lot of material interrogating the choice of the virtual disk (such as this great post) but less discussion on guest vs. host disk bandwidth.

mercenary_sysadmin · May 29, 2025, 1:00am

Sequential write results aren’t really useful. You’ve essentially told fio to just cosplay as dd.

If you want a reasonable facsimile of “maximum throughput with only relatively easy traffic”, do a 25/75 read/write mix of 1MiB random I/O. Top speed is generally fine here; latency is very rarely a problem compared to throughput, with this type of workload.

If you want a reasonable facsimule of “maximum throughput with nasty database type activity”, try a 25/75 read/write mix of 16K random I/O. Then try it again with throughput limited to roughly 3/4 of the maximum throughput number you just got, and now pay attention to the latency numbers on your throughput-limited run.

If you want a reasonable facsimile of mixed I/O of the kind you’d get from heavy desktop usage, try a 25/75 read/write mix of 64K random I/O. After the first run, note the throughput you got. Now, repeat the run with the maximum throughput limited to roughly half what the full-speed run produced, and pay attention to the latency on that throughput-limited run.

You can get even fancier than that, and directly mix 4K, 64K, and 1M I/O in the same run, with different mixes of reads, writes, throughput limits, and more on each stream all at the same time. But that’s getting a bit advanced, and there’s not a whole lot of point unless you REALLY understand the workload you’re trying to model, and are trying to model it as accurately as possible.

muay_throwaway · May 29, 2025, 2:40am

Thank you! I appreciate your input. How would you simulate (in fio) more of a data-transfer workload (e.g., rsync of large media files on the order of 10s to 100s of GB)? Or how would you simulate a process similar to zfs send/recv?

In terms of guest filesystems, do you ever get close to the IO bandwidth of a host ZFS filesystem? Or is there usually a relatively large loss (e.g., ~50%) of performance?

mercenary_sysadmin · May 29, 2025, 4:09am

1MiB random I/O. If you’re really only concerned about rsync in one direction, then 100% read or 100% write. If that’s the only major workload you expect to be happening when you run it, just run the full-speed test and check throughput, and don’t worry about latency.

If you’re worried about the impact on normal operation while one of those rsync runs is going… do the 1MiB random read or write (as appropriate), and do a separate run with, for instance, the 25/75 R/W random 64K I/O limited to about 25% of your normal max throughput for that 64K random workload, then look at latency (while ignoring the throughput on the first job simulating the rsync run; that’s just there to put pressure on the system to see how bad it makes the latency of your desktop-type load).

It’s probably worth noting that rsync is actually going to require reads AND writes, on the target side, since it’s got to grovel over every block of the target before deciding which blocks it needs to pull from the source. So you’ll have, essentially, large reads of the same blocks twice on the source, vs large reads first and large writes later on the target. Again, this is all assuming we’re talking about your “large media files” as basically a solo load.

There’s also some 4K random I/O from needing to stat all the files first, but you don’t really need to worry about that if what you’re rsyncing is all large media files; there won’t be enough of them for that phase to make much of an impact. If you ever need to rsync tens of thousands of small files, though… oof. Lotta 4K. Best to avoid needing to do that at all, if at all possible!

Definitely, it just needs to be properly tuned. If you’re working with large media files, you need to have recordsize=1M at minimum; ideally, larger than that if you’re working with RAIDz (which splits blocks up into pieces distributed across the drives in the vdev). Ideally, you’d like to have 1M random I/O per disk, which would mean eg recordsize=4M if you’re rocking six-wide Z2 vdevs.

On mirrors, there’s not usually much point in bumping recordsize up past 1M. You’ll see some gains, on workloads with really massive files, but not usually enough to matter.

muay_throwaway · May 29, 2025, 6:44pm

Thank you very much for all your insights! This is very helpful! I appreciate you takimg the time to share your expertise.