Understanding ZFS fio benchmarks

tony1661 · May 16, 2024, 7:14pm

Hi All,

I am struggling to understand if my fio benchmarks are showing me good or bad results.

I am currently running 8x 2.4TB SAS 10k drives. I have them as a pool or mirrors. I think that means 4 vdevs right?

I was under the assumption that my reads and writes should be fast in this config but the way I am interpreting my fio results it seems slow.

FIO Command 1:

fio --name=random-write --ioengine=posixaio --rw=randwrite --bs=4k --numjobs=1 --size=4g --iodepth=1 --runtime=60 --time_based --end_fsync=1

FIO Results 1:

Run status group 0 (all jobs):
  WRITE: bw=15.2MiB/s (15.9MB/s), 15.2MiB/s-15.2MiB/s (15.9MB/s-15.9MB/s), io=1016MiB (1065MB), run=66959-66959msec

15.9MB/s seems slow, does it not?

When I do a randread I get the following:

FIO Command 2:

fio --name=random-write --ioengine=posixaio --rw=randread --bs=4k --numjobs=1 --size=4g --iodepth=1 --runtime=60 --time_based --end_fsync=1

FIO Results 2:

Run status group 0 (all jobs):
   READ: bw=96.9MiB/s (102MB/s), 96.9MiB/s-96.9MiB/s (102MB/s-102MB/s), io=5813MiB (6096MB), run=60001-60001msec

102MB/s seems great.

Is the issue that for writes I have too many drives that I need to write to which is creating a slowdown?

I am wondering if this is what is causing my Windows VMs to be slow.

Thanks for reading this far

mercenary_sysadmin · May 17, 2024, 2:25am

The issue is that you’re asking a bunch of rust to do single-process 4K random reads with an iodepth of 1. That means each individual block must be pulled on its lonesome, with no parallelization possible on the compute side.

As a result, you’re bound heavily by the single-sector latency of your individual drives.

Try it again with numjobs=4 or numjobs=8 and see what your results look like. Or use a higher blocksize, or at the minimum, give it some iodepth so it can request multiple blocks at once, rather than having to wait for the first block to return before it can fetch the next.

This is not ZFS-specific, by the way. You’d be every bit as disappointed with any other type of RAID, with this specific workload.

If you want to accelerate single-process 4K random I/O, you pretty much need faster drives; there’s not much that topology can do for you. With that said, mirrors are still definitely the right choice if 4K I/O is what you’re concerned with–because in a RAIDz2 vdev, you’d be experiencing a godawful 33% storage efficiency, not the on-paper storage efficiency you expected. That’s because the Z2 would have to store each block as an undersized stripe on three disks only, with one disk being data and the other two being parity, regardless of the total width of the vdev!

tony1661 · May 17, 2024, 1:58pm

Thank you for your response.

I gave it a show with random write with an iodepth=8 and numjobs=8 and things actually seems worse.

Command:

fio --name=random-write --ioengine=posixaio --rw=randwrite --bs=4k --numjobs=8 --size=4g --iodepth=8 --runtime=60 --time_based --end_fsync=1

Here are the results:

Run status group 0 (all jobs):
  WRITE: bw=9607KiB/s (9838kB/s), 1162KiB/s-1326KiB/s (1190kB/s-1358kB/s), io=589MiB (617MB), run=60870-62750msec

I may be not entering these benchmarks correctly because I am definitely a noob when it comes to storage optimization.

My though process for doing 4KB block sizes is that running VMs on these drives I think that is what the OS would most likely need, right?

I have my databases on a different dataset that is optimized for 64k since I need to use SQL Server.

mercenary_sysadmin · May 17, 2024, 3:21pm

64KiB is typically what you optimize for on general purpose VMs.

Here’s a baseline to compare to: ZFS versus RAID: Eight Ironwolf disks, two filesystems, one winner | Ars Technica

Essentially, at the three mirrors on rust level, you’re looking for around 30MiB/sec (untuned, recordsize=128K) or close to 200MiB/sec (tuned, recordsize=4K) when you’re doing 4KiB writes with numjobs=8, iodepth=8.

With that said, do NOT optimize for 4K unless you’ve genuinely got an EXTREMELY 4K biased workload… Which, realistically, you just plain do not. Stick with 64K unless you’ve got a very database heavy workload on an engine that uses smaller page sizes–16K for MySQL, potentially 8K for postgres.

But even then, you do not put the entire VM on a dataset with recordsize that small. Instead, you give the VM a root virtual disk on rs=64K, and another virtual disk for ONLY the database storage at the smaller recordsize.

tony1661 · May 17, 2024, 9:21pm

Gotcha. Is the recommendation to use 64K record sizes based on the fact that qcow2 uses 64KB by default for it’s cluster size?

I have my current virtual machines on proxmox which is using zvols that are 8k in size. I am really thinking about getting off Proxmox so that I can fine-tune my datasets and qcow2 files. Would 64K be what you would recommend for a windows C drive for example to have Windows run smoothly?

I believe windows uses 4K cluster sizes if you are below 16TB. Wouldn’t this be good to match with the underlying ZFS file system?

Sorry I am getting a bit off topic to the original question but this may help me understand the theory of it better

mercenary_sysadmin · May 17, 2024, 9:44pm

Sorta-kinda, but not entirely: if you’re using qcow2, you DEFINITELY need to match your recordsize to the cluster_size parameter you used when you qemu-img created the qcow2 file. But even if you’re using raw, I recommend that size, for the same reason that QEMU defaults to it in the first place–it’s about the best blend between “not too much write amplification on the low end” and “not too much IOPS amplification on the high end.”

Even with ZVOL, I would recommend trying a larger volblocksize; Proxmox’s default 8K zvols perform horribly on most applications. I’m not 100% sure 64K will be the sweet spot there–might be 32K–but I’m quite positive that 8K was a terrible idea that the Proxmox devs should feel terrible about inflicting on so many users.

It’s probably worth noting that the latest PVE now defaults to 16K volblocksize, not 8K. Which may or may not have anything to do with the amicus curiae style bug report I submitted at the behest of affected Proxmox users last year, on that very topic.

tony1661 · May 18, 2024, 2:11am

This is great info, thank you!

I was under the assumption however that if my Windows VM using 4K cluster sizes would waste space by setting my record size to 64K.

I think I am wrong and that my understanding of this is wrong.

I though that my settings the record size to 64K means that that is the minimum size of a record. If Windows is storing 4K cluster sizes, wouldn’t that mean that I am losing 60K?

Do you have any resources a noob like me can read?

mercenary_sysadmin · May 18, 2024, 3:22am

Windows can and will still issue 4KiB writes when it wants to (which isn’t frequently, for most workloads) but they’re batched up at the host level and hit the real metal in 64KiB (if rs=64K) blocks. You get hugely improved throughput for everything bigger than 4KiB writes, and the only real impact on your 4KiB I/O is a little read amplification… Which also won’t actually affect you in most workloads, since most 4KiB writes will be read back in larger batches anyway, when and if they’re read back (eg log streams) and so the extra data usually isn’t a “wasted” read anyway.

Like I said, database engines are where you need to get careful and consider multiple datasets with individually tuned recordsize. Beyond that, you’re basically just shooting for an intermediate value that strikes a balance between small I/O latency (smaller recordsize) and large I/O throughput (larger recordsize).

OpenZFS almost gets you there already with the default 128KiB, and for the same reason. We’re just going one step narrower than that, because VM storage behaves a lot like a DB engine that uses 64KiB extents.

tony1661 · May 21, 2024, 11:58am

This has all been such great info. Thank you so much. I have some Windows VMs operating on 8k record sizes and have had poor performance.

I didn’t see anyone else on youtube or forums point me in this direction.

It’s possible that they are not using ZFS however wouldn’t the whole record size thing be a problem also if using ext4 for example? I believe ext4 uses 4k by default so by having 4k ext4 partitions with VMs, that might be just as bad as 4k datasets.

mercenary_sysadmin · May 21, 2024, 12:38pm

Yes, it’s still a problem outside the ZFS world. It’s usually more important to address it there at the RAID layer rather than the filesystem layer, though; you’d (ideally) be worrying about stride and chunk size in the conventional world.

If you want to see what that looks like, here is a random example (nb: I have not carefully evaluated the quality of the specific advice in that thread in context): Tuning of ext4 on LVM and RAID. - General - openmediavault

tony1661 · May 21, 2024, 5:56pm

Thank you. So if I were to benchmark my drives to show me VM performance, I should be doing that with something like this:

fio --name=random-write --ioengine=posixaio --rw=randwrite --bs=64k --numjobs=8 --size=4g --iodepth=8 --runtime=60 --time_based --end_fsync=1

This assumes that my underlying dataset is 64k for optimal performance.

mercenary_sysadmin · May 21, 2024, 5:59pm

That’s how you should benchmark the drives at the host level to look for optimal performance in the typical tuning for a VM, yes.

With that said, if you want to tune the actual VM, the best way to do it is to run fio inside the VM, approximating the VM’s usual workload. (This may involve rebuilding the VM on different recordsizes several times. Nobody said tuning wasn’t a pain in the ass, which is why many experts–yours truly very much included–generally only tune for a general rule of thumb, and leave the fine-detailed stuff alone unless and until we actually need it!)

tony1661 · May 21, 2024, 6:11pm

That makes sense. I will likely need to rebuild some VMs.

Would you recommend setting up my windows VMs to use 64k NTFS sector sizes?

mercenary_sysadmin · May 21, 2024, 10:22pm

Windows VMs do very well under 64KiB recordsize datasets without retuning NTFS sector sizes. I would advise avoiding that layer of tuning unless something gives you the impression that you really need it.

tony1661 · May 22, 2024, 12:44pm

Just to follow up with my findings for yourself and anyone who follows this in the future:

I have two identical physical hosts setup with the exact same 8 disk, 4 vdev, pool of mirrors.

Host 1 is running Proxmox and is using 8k zvols
Host 2 is running Ubuntu 24.04 and is using a 64k dataset as recommended by @mercenary_sysadmin

I ran the following fio benchmark on the hypervisor to hopefully emulate a typical VM workload.

fio --name=randwrite --ioengine=posixaio --rw=randwrite --bs=64k --numjobs=8 --size=4g --iodepth=8 --runtime=60 --time_based --end_fsync=1 --group_reporting

On the windows VMs I substituted the ioengine for windowsaio

I did 4 tests

Directly on the pool that holds my zvols on Proxmox
Directly on my dataset on the Ubuntu KVM server
My super slow windows server VM (default 8k block size zvol) - Proxmox Host
My new windows 10 VM (64k record size on zfs dataset) - Ubuntu KVM Host

Windows 2012 has a separate partition for this test.
Windows 10 is using the c: drive (since it is a fresh VM I am hoping Windows won’t interfere with it too much)

Results

Test 1 (Directly on pool - Proxmox)

randwrite: (groupid=0, jobs=8): err= 0: pid=1229003: Tue May 21 22:18:18 2024
  write: IOPS=4419, BW=276MiB/s (290MB/s)(19.5GiB/72282msec); 0 zone resets
Run status group 0 (all jobs):
  WRITE: bw=276MiB/s (290MB/s), 276MiB/s-276MiB/s (290MB/s-290MB/s), io=19.5GiB (20.9GB), run=72282-72282msec

Test 2 (Directly on pool - Ubuntu 24.04 KVM)

randwrite: (groupid=0, jobs=8): err= 0: pid=42883: Wed May 22 12:33:19 2024
  write: IOPS=7098, BW=444MiB/s (465MB/s)(29.8GiB/68697msec); 0 zone resets
Run status group 0 (all jobs):
  WRITE: bw=444MiB/s (465MB/s), 444MiB/s-444MiB/s (465MB/s-465MB/s), io=29.8GiB (32.0GB), run=68697-68697msec

Test 3 (On Windows server VM)

randwrite: (groupid=0, jobs=8): err= 0: pid=3152: Tue May 21 20:05:36 2024
  write: IOPS=0, BW=31.4KiB/s (32.1kB/s)(8256KiB/263124msec); 0 zone resets
Run status group 0 (all jobs):
  WRITE: bw=31.4KiB/s (32.1kB/s), 31.4KiB/s-31.4KiB/s (32.1kB/s-32.1kB/s), io=8256KiB (8454kB), run=263124-263124msec

Test 4 (On Windows 10 VM)

randwrite: (groupid=0, jobs=8): err= 0: pid=4828: Wed May 22 05:17:00 2024
  write: IOPS=4540, BW=284MiB/s (298MB/s)(17.9GiB/64485msec); 0 zone resets
Run status group 0 (all jobs):
  WRITE: bw=284MiB/s (298MB/s), 284MiB/s-284MiB/s (298MB/s-298MB/s), io=17.9GiB (19.2GB), run=64485-64485msec

My conclusion

On the baremetal hypervisor, I am seeing a performace boost, which, after talking with Jim in this post, I am not surprised about.

What shocked me was how much better my Windows 10 VM performed over my Server VM. Wow! This server is running my extremely slow SQL Server and this is likely the cause.

Here are the results in a table format:

Host	IOPS	Write Speed
Proxmox Baremetal	4419	290MB/s
KVM Baremetal	7098	465MB/s
Windows Server VM	0	32.1kB/s
Windows 10 VM	4540	298MB/s

mercenary_sysadmin · May 22, 2024, 3:21pm

IIRC, MSSQL disables write-behind caching in the same way Active Directory domain controller roles do, which would account for some of the abysmal performance there. You don’t see much disparity when testing storage performance of a brand-new Windows Server vs a brand-new Windows 10, IME.

edit: if you still have copies of the full reports from fio, would you mind including the latency bits for the two hosts? Specifically, this bit here:

  write: IOPS=881, BW=55.1MiB/s (57.8MB/s)(21.8GiB/404430msec); 0 zone resets
    slat (usec): min=21, max=383528, avg=83.33, stdev=1079.09
    clat (usec): min=189, max=1182.3k, avg=47716.32, stdev=73455.95
     lat (usec): min=216, max=1182.3k, avg=47800.03, stdev=73478.44
    clat percentiles (usec):
     |  1.00th=[    343],  5.00th=[    367], 10.00th=[    383],
     | 20.00th=[   9896], 30.00th=[  18220], 40.00th=[  26608],
     | 50.00th=[  35390], 60.00th=[  44827], 70.00th=[  56361],
     | 80.00th=[  71828], 90.00th=[  95945], 95.00th=[ 121111],
     | 99.00th=[ 191890], 99.50th=[ 792724], 99.90th=[ 960496],
     | 99.95th=[ 994051], 99.99th=[1027605]

tony1661 · May 22, 2024, 4:16pm

The Windows Server I have has it’s database files on a pair of SSDs in a mirror. The system itself is slow and SQL performance is bad since the SQL service is running on C:\

May I ask what your results were run on? Vdev setup, etc.

Most likely that stems from the 8k zvol. Would you agree?

I re-ran the tests to get the latency

Proxmox Server (vmhost01)

vmhost01:/zpool1# fio --name=randwrite --ioengine=posixaio --rw=randwrite --bs=64k --numjobs=8 --size=4g --iodepth=8 --runtime=60 --time_based --end_fsync=1 --group_reporting
randwrite: (g=0): rw=randwrite, bs=(R) 64.0KiB-64.0KiB, (W) 64.0KiB-64.0KiB, (T) 64.0KiB-64.0KiB, ioengine=posixaio, iodepth=8
...
fio-3.25
Starting 8 processes
randwrite: Laying out IO file (1 file / 4096MiB)
randwrite: Laying out IO file (1 file / 4096MiB)
randwrite: Laying out IO file (1 file / 4096MiB)
randwrite: Laying out IO file (1 file / 4096MiB)
randwrite: Laying out IO file (1 file / 4096MiB)
randwrite: Laying out IO file (1 file / 4096MiB)
randwrite: Laying out IO file (1 file / 4096MiB)
randwrite: Laying out IO file (1 file / 4096MiB)
Jobs: 7 (f=7): [F(5),_(1),F(2)][100.0%][eta 00m:00s]               
randwrite: (groupid=0, jobs=8): err= 0: pid=3519160: Wed May 22 11:13:15 2024
  write: IOPS=2210, BW=138MiB/s (145MB/s)(10.2GiB/75737msec); 0 zone resets
    slat (nsec): min=1113, max=803710, avg=4502.69, stdev=4041.99
    clat (usec): min=124, max=181643, avg=22942.51, stdev=10423.80
     lat (usec): min=135, max=181649, avg=22947.01, stdev=10424.03
    clat percentiles (usec):
     |  1.00th=[   693],  5.00th=[   832], 10.00th=[  1106], 20.00th=[ 19792],
     | 30.00th=[ 21365], 40.00th=[ 22938], 50.00th=[ 24511], 60.00th=[ 26346],
     | 70.00th=[ 27395], 80.00th=[ 29492], 90.00th=[ 32637], 95.00th=[ 35914],
     | 99.00th=[ 45876], 99.50th=[ 47973], 99.90th=[ 68682], 99.95th=[ 86508],
     | 99.99th=[158335]
   bw (  KiB/s): min=83328, max=1814016, per=100.00%, avg=178639.07, stdev=22875.82, samples=959
   iops        : min= 1302, max=28344, avg=2790.72, stdev=357.44, samples=959
  lat (usec)   : 250=0.01%, 500=0.05%, 750=2.27%, 1000=6.90%
  lat (msec)   : 2=2.04%, 4=0.78%, 10=1.86%, 20=6.81%, 50=78.99%
  lat (msec)   : 100=0.25%, 250=0.04%
  cpu          : usr=0.65%, sys=0.24%, ctx=167996, majf=0, minf=764
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=99.9%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,167434,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=8

Run status group 0 (all jobs):
  WRITE: bw=138MiB/s (145MB/s), 138MiB/s-138MiB/s (145MB/s-145MB/s), io=10.2GiB (10.0GB), run=75737-75737msec

Ubuntu KVM Server (vmhost02)

vmhost02:/zpool1$ sudo fio --name=randwrite --ioengine=posixaio --rw=randwrite --bs=64k --numjobs=8 --size=4g --iodepth=8 --runtime=60 --time_based --end_fsync=1 --group_reporting 
randwrite: (g=0): rw=randwrite, bs=(R) 64.0KiB-64.0KiB, (W) 64.0KiB-64.0KiB, (T) 64.0KiB-64.0KiB, ioengine=posixaio, iodepth=8
...
fio-3.36
Starting 8 processes
randwrite: Laying out IO file (1 file / 4096MiB)
randwrite: Laying out IO file (1 file / 4096MiB)
randwrite: Laying out IO file (1 file / 4096MiB)
randwrite: Laying out IO file (1 file / 4096MiB)
randwrite: Laying out IO file (1 file / 4096MiB)
randwrite: Laying out IO file (1 file / 4096MiB)
randwrite: Laying out IO file (1 file / 4096MiB)
randwrite: Laying out IO file (1 file / 4096MiB)
Jobs: 8 (f=8): [F(8)][100.0%][eta 00m:00s]                         
randwrite: (groupid=0, jobs=8): err= 0: pid=51968: Wed May 22 16:15:42 2024
  write: IOPS=7525, BW=470MiB/s (493MB/s)(30.3GiB/66049msec); 0 zone resets
    slat (nsec): min=810, max=12480k, avg=3605.62, stdev=18952.40
    clat (usec): min=135, max=121471, avg=7718.77, stdev=3432.35
     lat (usec): min=169, max=121475, avg=7722.38, stdev=3432.26
    clat percentiles (usec):
     |  1.00th=[  586],  5.00th=[ 3523], 10.00th=[ 5342], 20.00th=[ 6194],
     | 30.00th=[ 6587], 40.00th=[ 7046], 50.00th=[ 7373], 60.00th=[ 7767],
     | 70.00th=[ 8225], 80.00th=[ 8848], 90.00th=[10945], 95.00th=[12911],
     | 99.00th=[15795], 99.50th=[19530], 99.90th=[44303], 99.95th=[57934],
     | 99.99th=[93848]
   bw (  KiB/s): min=252288, max=2024960, per=100.00%, avg=532165.82, stdev=22605.73, samples=952
   iops        : min= 3942, max=31640, avg=8314.99, stdev=353.22, samples=952
  lat (usec)   : 250=0.01%, 500=0.44%, 750=1.46%, 1000=0.56%
  lat (msec)   : 2=1.13%, 4=2.00%, 10=81.24%, 20=12.70%, 50=0.39%
  lat (msec)   : 100=0.07%, 250=0.01%
  cpu          : usr=1.73%, sys=0.81%, ctx=505354, majf=0, minf=1780
  IO depths    : 1=0.1%, 2=0.1%, 4=0.2%, 8=99.8%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,497075,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=8

Run status group 0 (all jobs):
  WRITE: bw=470MiB/s (493MB/s), 470MiB/s-470MiB/s (493MB/s-493MB/s), io=30.3GiB (32.6GB), run=66049-66049msec

mercenary_sysadmin · May 22, 2024, 4:55pm

Thanks for re-running those tests; this confirms that performance is better ALL the way around (not merely improving throughput at the expense of latency) with the 64K dataset than the 8K zvol.

The Ars article I linked earlier in the thread used a server with eight 12TB Ironwolf drives; the random fio snippet I showed you when I asked for the latency stats is of unknown provenance (I found it lying around on a server I knew still had litter from various tuning runs several years ago).

tony1661 · May 22, 2024, 6:02pm

Re-running those tests is the least I can do. Thank you for all your help. I am going to be moving my databases over this week to see if my users can notice a difference. Considering I was getting 32kB/s on my current prod, I think this may be the solution I needed. Fingers crossed

mercenary_sysadmin · May 22, 2024, 8:41pm

Let me know how it goes, will you?