Understanding ZFS fio benchmarks

denvercoder9 · May 27, 2024, 4:30pm

Instead of rebuilding the VM, can’t you use “qemu-img convert” with a new “cluster_size” to make a copy of the qcow2 in the cluster size you want?

NOTE: The following changes the value of “cluster_size” from the default of 64 KiB to 128 KiB to match the ZFS dataset recordsize. Please be warned that my command does use “preallocation=off” (Empty Sector Suppression), but you can probably omit that.

sudo bash -c 'qemu-img convert -f qcow2 /mnt/data1/guest_images/jammy_00.qcow2_-_2022-07-03_0945 -o preallocation=off,cluster_size=128K -O qcow2 /mnt/data1/guest_images/jammy_00.qcow2'

mercenary_sysadmin · May 27, 2024, 7:22pm

Yep. I’d consider this one form of “rebuilding the VM,” since it must grovel over it and make a brute force copy block-for-block, though.

waltar · July 17, 2024, 6:43pm

Still thinking why hw raid is dead even while eg millions of millions win user do not even rely with their data on zfs or did you ever cannot call anybody while your phone bitrot the number? I like the zfs self healing and thousands of fast snapshot build but in real life performance it’s still behind conventional technology. Nevertheless few compares of the here used unreal fio workloads with hw
2x Intel 6246R, 192GB, MegaRAID 9580-8i8e (mdadm is slow), 26x18TB in space efficient raid6 (+spare) with “tuned” xfs, directly on server:

fio --name=random-write --ioengine=posixaio --rw=randwrite --bs=4k --numjobs=1 --size=4g --iodepth=1 --runtime=60 --time_based --end_fsync=1
Run status group 0 (all jobs):
  WRITE: bw=439MiB/s (460MB/s), 439MiB/s-439MiB/s (460MB/s-460MB/s), io=26.3GiB (28.2GB), run=61255-61255msec

fio --name=random-write --ioengine=posixaio --rw=randread --bs=4k --numjobs=1 --size=4g --iodepth=1 --runtime=60 --time_based --end_fsync=1
Run status group 0 (all jobs):
READ: bw=427MiB/s (447MB/s), 427MiB/s-427MiB/s (447MB/s-447MB/s), io=25.0GiB (26.8GB), run=60001-60001msec

fio --name=random-write --ioengine=posixaio --rw=randwrite --bs=4k --numjobs=8 --size=4g --iodepth=8 --runtime=60 --time_based --end_fsync=1 --group_reporting
Run status group 0 (all jobs):
  WRITE: bw=2031MiB/s (2129MB/s), 2031MiB/s-2031MiB/s (2129MB/s-2129MB/s), io=132GiB (142GB), run=66600-66600msec

fio --name=randwrite --ioengine=posixaio --rw=randwrite --bs=64k --numjobs=8 --size=4g --iodepth=8 --runtime=60 --time_based --end_fsync=1 --group_reporting
randwrite: (groupid=0, jobs=8): err= 0: pid=123719: Wed Jul 17 20:34:47 2024
write: IOPS=23.3k, BW=1456MiB/s (1527MB/s)(96.0GiB/67519msec); 0 zone resets
Run status group 0 (all jobs):
WRITE: bw=1456MiB/s (1527MB/s), 1456MiB/s-1456MiB/s (1527MB/s-1527MB/s), io=96.0GiB (103GB), run=67519-67519msec

Running for fun on a client nfs4 with ipoib 100Mb IB mounted from server:

fio --name=randwrite --ioengine=posixaio --rw=randwrite --bs=64k --numjobs=8 --size=4g --iodepth=8 --runtime=60 --time_based --end_fsync=1 --group_reporting
randwrite: (groupid=0, jobs=8): err= 0: pid=29512: Wed Jul 17 20:38:26 2024
write: IOPS=81.8k, BW=5112MiB/s (5360MB/s)(354GiB/70984msec); 0 zone resets
Run status group 0 (all jobs):
WRITE: bw=5112MiB/s (5360MB/s), 5112MiB/s-5112MiB/s (5360MB/s-5360MB/s), io=354GiB (380GB), run=70984-70984msec
…
As even seen nfs performance daily rsync one to other server by mount and even more than 1 rsync is ever faster for a couple of changed eg engineering data (eg 500GB/d) than zfs send/recv which is even more uncommon like mirror that way running vm images.
So, zfs has it’s usage group and I think the rest without zfs with other requirements still quiet also.