Understanding ZFS fio benchmarks

Hi All,

I am struggling to understand if my fio benchmarks are showing me good or bad results.

I am currently running 8x 2.4TB SAS 10k drives. I have them as a pool or mirrors. I think that means 4 vdevs right?

I was under the assumption that my reads and writes should be fast in this config but the way I am interpreting my fio results it seems slow.

FIO Command 1:

fio --name=random-write --ioengine=posixaio --rw=randwrite --bs=4k --numjobs=1 --size=4g --iodepth=1 --runtime=60 --time_based --end_fsync=1

FIO Results 1:

Run status group 0 (all jobs):
  WRITE: bw=15.2MiB/s (15.9MB/s), 15.2MiB/s-15.2MiB/s (15.9MB/s-15.9MB/s), io=1016MiB (1065MB), run=66959-66959msec

15.9MB/s seems slow, does it not?

When I do a randread I get the following:

FIO Command 2:

fio --name=random-write --ioengine=posixaio --rw=randread --bs=4k --numjobs=1 --size=4g --iodepth=1 --runtime=60 --time_based --end_fsync=1

FIO Results 2:

Run status group 0 (all jobs):
   READ: bw=96.9MiB/s (102MB/s), 96.9MiB/s-96.9MiB/s (102MB/s-102MB/s), io=5813MiB (6096MB), run=60001-60001msec

102MB/s seems great.

Is the issue that for writes I have too many drives that I need to write to which is creating a slowdown?

I am wondering if this is what is causing my Windows VMs to be slow.

Thanks for reading this far

1 Like

The issue is that you’re asking a bunch of rust to do single-process 4K random reads with an iodepth of 1. That means each individual block must be pulled on its lonesome, with no parallelization possible on the compute side.

As a result, you’re bound heavily by the single-sector latency of your individual drives.

Try it again with numjobs=4 or numjobs=8 and see what your results look like. Or use a higher blocksize, or at the minimum, give it some iodepth so it can request multiple blocks at once, rather than having to wait for the first block to return before it can fetch the next.

This is not ZFS-specific, by the way. You’d be every bit as disappointed with any other type of RAID, with this specific workload.

If you want to accelerate single-process 4K random I/O, you pretty much need faster drives; there’s not much that topology can do for you. With that said, mirrors are still definitely the right choice if 4K I/O is what you’re concerned with–because in a RAIDz2 vdev, you’d be experiencing a godawful 33% storage efficiency, not the on-paper storage efficiency you expected. That’s because the Z2 would have to store each block as an undersized stripe on three disks only, with one disk being data and the other two being parity, regardless of the total width of the vdev!

Thank you for your response.

I gave it a show with random write with an iodepth=8 and numjobs=8 and things actually seems worse.


fio --name=random-write --ioengine=posixaio --rw=randwrite --bs=4k --numjobs=8 --size=4g --iodepth=8 --runtime=60 --time_based --end_fsync=1

Here are the results:

Run status group 0 (all jobs):
  WRITE: bw=9607KiB/s (9838kB/s), 1162KiB/s-1326KiB/s (1190kB/s-1358kB/s), io=589MiB (617MB), run=60870-62750msec

I may be not entering these benchmarks correctly because I am definitely a noob when it comes to storage optimization.

My though process for doing 4KB block sizes is that running VMs on these drives I think that is what the OS would most likely need, right?

I have my databases on a different dataset that is optimized for 64k since I need to use SQL Server.

64KiB is typically what you optimize for on general purpose VMs.

Here’s a baseline to compare to: ZFS versus RAID: Eight Ironwolf disks, two filesystems, one winner | Ars Technica

Essentially, at the three mirrors on rust level, you’re looking for around 30MiB/sec (untuned, recordsize=128K) or close to 200MiB/sec (tuned, recordsize=4K) when you’re doing 4KiB writes with numjobs=8, iodepth=8.

With that said, do NOT optimize for 4K unless you’ve genuinely got an EXTREMELY 4K biased workload… Which, realistically, you just plain do not. Stick with 64K unless you’ve got a very database heavy workload on an engine that uses smaller page sizes–16K for MySQL, potentially 8K for postgres.

But even then, you do not put the entire VM on a dataset with recordsize that small. Instead, you give the VM a root virtual disk on rs=64K, and another virtual disk for ONLY the database storage at the smaller recordsize.

Gotcha. Is the recommendation to use 64K record sizes based on the fact that qcow2 uses 64KB by default for it’s cluster size?

I have my current virtual machines on proxmox which is using zvols that are 8k in size. I am really thinking about getting off Proxmox so that I can fine-tune my datasets and qcow2 files. Would 64K be what you would recommend for a windows C drive for example to have Windows run smoothly?

I believe windows uses 4K cluster sizes if you are below 16TB. Wouldn’t this be good to match with the underlying ZFS file system?

Sorry I am getting a bit off topic to the original question but this may help me understand the theory of it better :slightly_smiling_face:

1 Like

Sorta-kinda, but not entirely: if you’re using qcow2, you DEFINITELY need to match your recordsize to the cluster_size parameter you used when you qemu-img created the qcow2 file. But even if you’re using raw, I recommend that size, for the same reason that QEMU defaults to it in the first place–it’s about the best blend between “not too much write amplification on the low end” and “not too much IOPS amplification on the high end.”

Even with ZVOL, I would recommend trying a larger volblocksize; Proxmox’s default 8K zvols perform horribly on most applications. I’m not 100% sure 64K will be the sweet spot there–might be 32K–but I’m quite positive that 8K was a terrible idea that the Proxmox devs should feel terrible about inflicting on so many users.

It’s probably worth noting that the latest PVE now defaults to 16K volblocksize, not 8K. Which may or may not have anything to do with the amicus curiae style bug report I submitted at the behest of affected Proxmox users last year, on that very topic. :cowboy_hat_face:

1 Like

This is great info, thank you!

I was under the assumption however that if my Windows VM using 4K cluster sizes would waste space by setting my record size to 64K.

I think I am wrong and that my understanding of this is wrong.

I though that my settings the record size to 64K means that that is the minimum size of a record. If Windows is storing 4K cluster sizes, wouldn’t that mean that I am losing 60K?

Do you have any resources a noob like me can read?

Windows can and will still issue 4KiB writes when it wants to (which isn’t frequently, for most workloads) but they’re batched up at the host level and hit the real metal in 64KiB (if rs=64K) blocks. You get hugely improved throughput for everything bigger than 4KiB writes, and the only real impact on your 4KiB I/O is a little read amplification… Which also won’t actually affect you in most workloads, since most 4KiB writes will be read back in larger batches anyway, when and if they’re read back (eg log streams) and so the extra data usually isn’t a “wasted” read anyway.

Like I said, database engines are where you need to get careful and consider multiple datasets with individually tuned recordsize. Beyond that, you’re basically just shooting for an intermediate value that strikes a balance between small I/O latency (smaller recordsize) and large I/O throughput (larger recordsize).

OpenZFS almost gets you there already with the default 128KiB, and for the same reason. We’re just going one step narrower than that, because VM storage behaves a lot like a DB engine that uses 64KiB extents.