Background
While setting up a new machine, I did some initial benchmarking on NVMe drives. I noticed a huge degradation on 4K random reads as soon as ZFS is involved. I am interested in getting as much IOPS as possible, also on non-cached (meta-) data / files.
I went all the way down bypassing as much ZFS goodies as possible, just to see raw IOPS performance. No tuning helped. Finally, I ran benchmarks against a single-vdev-pool sitting on a ramdisk. Here it gets interesting.
System Specs:
- Debian (Proxmox Kernel 6.14.11-3-pve, zfs-2.3.4-pve1)
- 128G RAM
- Epyc 7003 processor
- ECC
- Machine basically idle, bare metal, no guests, fresh install
To Reproduce
(1) Create a RAM disk and benchmark for 4K rand read
mkdir /media/ramdisk
mount -t ramfs ramfs /media/ramdisk
fio --ioengine=libaio --name=io --rw=randread --bs=4k --iodepth=64 --ramp_time=2 --runtime=20 --filename=/media/ramdisk/testfile --size=512M --time_based=1 --sync=0 --direct=0 --numjobs=1 --buffered=1
io: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64
fio-3.39
Starting 1 process
Jobs: 1 (f=1): [r(1)][100.0%][r=1380MiB/s][r=353k IOPS][eta 00m:00s]
io: (groupid=0, jobs=1): err= 0: pid=13610: Fri Oct 10 11:57:48 2025
read: IOPS=352k, BW=1374MiB/s (1441MB/s)(26.8GiB/20001msec)
slat (nsec): min=1380, max=1106.4k, avg=1739.30, stdev=694.54
clat (usec): min=2, max=1520, avg=179.96, stdev= 7.76
lat (usec): min=4, max=1525, avg=181.70, stdev= 7.81
clat percentiles (usec):
| 1.00th=[ 174], 5.00th=[ 176], 10.00th=[ 176], 20.00th=[ 178],
| 30.00th=[ 178], 40.00th=[ 180], 50.00th=[ 180], 60.00th=[ 182],
| 70.00th=[ 182], 80.00th=[ 182], 90.00th=[ 184], 95.00th=[ 186],
| 99.00th=[ 190], 99.50th=[ 202], 99.90th=[ 281], 99.95th=[ 285],
| 99.99th=[ 338]
bw ( MiB/s): min= 1346, max= 1386, per=100.00%, avg=1374.46, stdev= 8.20, samples=40
iops : min=344794, max=355034, avg=351861.38, stdev=2099.21, samples=40
lat (usec) : 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01%, 100=0.01%
lat (usec) : 250=99.86%, 500=0.14%, 750=0.01%, 1000=0.01%
lat (msec) : 2=0.01%
cpu : usr=35.40%, sys=64.56%, ctx=181, majf=0, minf=37
IO depths : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=100.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
issued rwts: total=7035380,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=64
Run status group 0 (all jobs):
READ: bw=1374MiB/s (1441MB/s), 1374MiB/s-1374MiB/s (1441MB/s-1441MB/s), io=26.8GiB (28.8GB), run=20001-20001msec
RESULT: FIO now maxes out 1 CPU core, delivering 353k IOPS.
(2) create a ZFS pool on the RAM disk
truncate -s 10G /media/ramdisk/zfsfile
zpool create -o ashift=12 -f rampool /media/ramdisk/zfsfile
(3) create a dataset and set some Testing (!) properties
dataset=testdataset
zfs create rampool/$dataset
zfs set sync=disabled rampool/$dataset
zfs set atime=off rampool/$dataset
zfs set compression=off rampool/$dataset
zfs set recordsize=4k rampool/$dataset
zfs set relatime=off rampool/$dataset
zfs set primarycache=none rampool/$dataset
zfs set secondarycache=none rampool/$dataset
zfs set prefetch=none rampool/$dataset
zfs set direct=always rampool/$dataset
zfs set xattr=off rampool/$dataset
umount /rampool/$dataset
zfs mount rampool/$dataset
(4) benchmarking ZFS
fio --ioengine=libaio --name=io --rw=randread --bs=4k --iodepth=64 --ramp_time=2 --runtime=20 --filename=/rampool/testdataset/testfile --size=512M --time_based=1 --sync=0 --direct=1 --numjobs=1 --buffered=0
io: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64
fio-3.39
Starting 1 process
io: Laying out IO file (1 file / 512MiB)
Jobs: 1 (f=1): [r(1)][100.0%][r=32.6MiB/s][r=8354 IOPS][eta 00m:00s]
io: (groupid=0, jobs=1): err= 0: pid=16610: Fri Oct 10 12:05:45 2025
read: IOPS=7967, BW=31.1MiB/s (32.6MB/s)(623MiB/20001msec)
slat (usec): min=4, max=35962, avg=123.87, stdev=321.42
clat (usec): min=2, max=50697, avg=7906.08, stdev=3495.07
lat (usec): min=125, max=50891, avg=8029.93, stdev=3529.03
clat percentiles (usec):
| 1.00th=[ 4817], 5.00th=[ 5080], 10.00th=[ 5342], 20.00th=[ 7046],
| 30.00th=[ 7439], 40.00th=[ 7504], 50.00th=[ 7570], 60.00th=[ 7832],
| 70.00th=[ 8455], 80.00th=[ 8586], 90.00th=[ 8586], 95.00th=[ 8979],
| 99.00th=[31327], 99.50th=[37487], 99.90th=[45351], 99.95th=[46924],
| 99.99th=[49021]
bw ( KiB/s): min=25704, max=48072, per=99.97%, avg=31874.43, stdev=6049.68, samples=40
iops : min= 6426, max=12018, avg=7968.58, stdev=1512.39, samples=40
lat (usec) : 4=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
lat (msec) : 2=0.01%, 4=0.01%, 10=95.93%, 20=2.87%, 50=1.21%
lat (msec) : 100=0.01%
cpu : usr=1.50%, sys=29.70%, ctx=442497, majf=0, minf=37
IO depths : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=100.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
issued rwts: total=159354,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=64
Run status group 0 (all jobs):
READ: bw=31.1MiB/s (32.6MB/s), 31.1MiB/s-31.1MiB/s (32.6MB/s-32.6MB/s), io=623MiB (653MB), run=20001-20001msec
RESULT: 8k IOPS, no process nearly maxes out the CPU. That’s 40 times less!
Observations while ZFS benchmark running:
zpool iostat rampool -qvy 1 30
capacity operations bandwidth syncq_read syncq_write asyncq_read asyncq_write scrubq_read trimq_write rebuildq_write
pool alloc free read write read write pend activ pend activ pend activ pend activ pend activ pend activ pend activ
------------------------ ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
rampool 2.01G 7.49G 23.6K 0 629M 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
/media/ramdisk/zfsfile 2.01G 7.49G 23.6K 0 629M 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
------------------------ ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
zpool iostat rampool -qvl 3 30
capacity operations bandwidth total_wait disk_wait syncq_wait asyncq_wait scrub trim rebuild syncq_read syncq_write asyncq_read asyncq_write scrubq_read trimq_write rebuildq_write
pool alloc free read write read write read write read write read write read write wait wait wait pend activ pend activ pend activ pend activ pend activ pend activ pend activ
------------------------ ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
rampool 2.01G 7.49G 19.5K 0 521M 0 19us - 19us - 642ns - - - - - - 0 0 0 0 0 0 0 0 0 0 0 0 0 0
/media/ramdisk/zfsfile 2.01G 7.49G 19.5K 0 521M 0 19us - 19us - 642ns - - - - - - 0 0 0 0 0 0 0 0 0 0 0 0 0 0
------------------------ ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
zpool iostat rampool -ry 3 30
rampool sync_read sync_write async_read async_write scrub trim rebuild
req_size ind agg ind agg ind agg ind agg ind agg ind agg ind agg
---------- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
512 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1K 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2K 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4K 6.85K 0 0 0 0 0 0 0 0 0 0 0 0 0
8K 0 0 0 0 0 0 0 0 0 0 0 0 0 0
16K 6.85K 0 0 0 0 0 0 0 0 0 0 0 0 0
32K 6.85K 0 0 0 0 0 0 0 0 0 0 0 0 0
64K 0 0 0 0 0 0 0 0 0 0 0 0 0 0
128K 0 0 0 0 0 0 0 0 0 0 0 0 0 0
256K 0 0 0 0 0 0 0 0 0 0 0 0 0 0
512K 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1M 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2M 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4M 0 0 0 0 0 0 0 0 0 0 0 0 0 0
8M 0 0 0 0 0 0 0 0 0 0 0 0 0 0
16M 0 0 0 0 0 0 0 0 0 0 0 0 0 0
IOTOP
Total DISK READ: 472.23 M/s | Total DISK WRITE: 56.22 K/s
Current DISK READ: 0.00 B/s | Current DISK WRITE: 3.27 M/s
TID PRIO USER DISK READ> DISK WRITE COMMAND
26468 be/4 root 472.23 M/s 0.00 B/s fio --ioengine=libaio --name=io --rw=randread --bs=4k --iodepth=1 --ramp_time=2 --runtime=20 --filename=/rampool/testdataset/testfile --size=4k --time_based=1 --sync=0 --direct=0 --numjobs=1 --buffered=1
1 be/4 root 0.00 B/s 0.00 B/s init
2 be/4 root 0.00 B/s 0.00 B/s [kthreadd]
3 be/4 root 0.00 B/s 0.00 B/s [pool_workqueue_release]
4 be/0 root 0.00 B/s 0.00 B/s [kworker/R-rcu_gp]
5 be/0 root 0.00 B/s 0.00 B/s [kworker/R-sync_wq]
6 be/0 root 0.00 B/s 0.00 B/s [kworker/R-kvfree_rcu_reclaim]
7 be/0 root 0.00 B/s 0.00 B/s [kworker/R-slub_flushwq]
8 be/0 root 0.00 B/s 0.00 B/s [kworker/R-netns]
...
Some additional notes:
-
The results translate when going on NVMe instead of a RAM disk. 200k raw/native, 3k on ZFS. Even worse: Same on stripped mirrors – no way to pass the 3k.
-
As seen in both zpool iostsat and iotop, FIO read bandwidth (actual read) is a magnitude (~500M/s) higher than the printed IOPs / bandwidth result. Whats going on here?
-
please note the dataset properties set for testing – especially cache and recordsize.
-
setting primarycache=metadata delivers ~30k IOPS but that’s not the scope of this test, just for comparison.
Question:
Where could that HUGE overhead come from? Is FIO / the query wrong? It’s a bit unlikely since it brings the expected results on the raw (RAM) volume without ZFS involved.
Anything that can be done / how to diagnose further?