Synthetic Single Thread Random Read (4K) Problem

Background

While setting up a new machine, I did some initial benchmarking on NVMe drives. I noticed a huge degradation on 4K random reads as soon as ZFS is involved. I am interested in getting as much IOPS as possible, also on non-cached (meta-) data / files.

I went all the way down bypassing as much ZFS goodies as possible, just to see raw IOPS performance. No tuning helped. Finally, I ran benchmarks against a single-vdev-pool sitting on a ramdisk. Here it gets interesting.

System Specs:

  • Debian (Proxmox Kernel 6.14.11-3-pve, zfs-2.3.4-pve1)
  • 128G RAM
  • Epyc 7003 processor
  • ECC
  • Machine basically idle, bare metal, no guests, fresh install

To Reproduce

(1) Create a RAM disk and benchmark for 4K rand read

mkdir /media/ramdisk

mount -t ramfs ramfs /media/ramdisk

fio --ioengine=libaio --name=io --rw=randread --bs=4k --iodepth=64 --ramp_time=2 --runtime=20 --filename=/media/ramdisk/testfile --size=512M  --time_based=1 --sync=0 --direct=0 --numjobs=1 --buffered=1

io: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64
fio-3.39
Starting 1 process
Jobs: 1 (f=1): [r(1)][100.0%][r=1380MiB/s][r=353k IOPS][eta 00m:00s]
io: (groupid=0, jobs=1): err= 0: pid=13610: Fri Oct 10 11:57:48 2025
  read: IOPS=352k, BW=1374MiB/s (1441MB/s)(26.8GiB/20001msec)
    slat (nsec): min=1380, max=1106.4k, avg=1739.30, stdev=694.54
    clat (usec): min=2, max=1520, avg=179.96, stdev= 7.76
     lat (usec): min=4, max=1525, avg=181.70, stdev= 7.81
    clat percentiles (usec):
     |  1.00th=[  174],  5.00th=[  176], 10.00th=[  176], 20.00th=[  178],
     | 30.00th=[  178], 40.00th=[  180], 50.00th=[  180], 60.00th=[  182],
     | 70.00th=[  182], 80.00th=[  182], 90.00th=[  184], 95.00th=[  186],
     | 99.00th=[  190], 99.50th=[  202], 99.90th=[  281], 99.95th=[  285],
     | 99.99th=[  338]
   bw (  MiB/s): min= 1346, max= 1386, per=100.00%, avg=1374.46, stdev= 8.20, samples=40
   iops        : min=344794, max=355034, avg=351861.38, stdev=2099.21, samples=40
  lat (usec)   : 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01%, 100=0.01%
  lat (usec)   : 250=99.86%, 500=0.14%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2=0.01%
  cpu          : usr=35.40%, sys=64.56%, ctx=181, majf=0, minf=37
  IO depths    : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued rwts: total=7035380,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
   READ: bw=1374MiB/s (1441MB/s), 1374MiB/s-1374MiB/s (1441MB/s-1441MB/s), io=26.8GiB (28.8GB), run=20001-20001msec

RESULT: FIO now maxes out 1 CPU core, delivering 353k IOPS.

(2) create a ZFS pool on the RAM disk

truncate -s 10G /media/ramdisk/zfsfile

zpool create -o ashift=12 -f rampool /media/ramdisk/zfsfile

(3) create a dataset and set some Testing (!) properties

dataset=testdataset

zfs create rampool/$dataset

zfs set sync=disabled rampool/$dataset
zfs set atime=off rampool/$dataset
zfs set compression=off rampool/$dataset
zfs set recordsize=4k rampool/$dataset
zfs set relatime=off rampool/$dataset
zfs set primarycache=none rampool/$dataset
zfs set secondarycache=none rampool/$dataset
zfs set prefetch=none rampool/$dataset
zfs set direct=always rampool/$dataset
zfs set xattr=off rampool/$dataset

umount /rampool/$dataset
zfs mount rampool/$dataset

(4) benchmarking ZFS

fio --ioengine=libaio --name=io --rw=randread --bs=4k --iodepth=64 --ramp_time=2 --runtime=20 --filename=/rampool/testdataset/testfile --size=512M  --time_based=1 --sync=0 --direct=1 --numjobs=1 --buffered=0

io: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64
fio-3.39
Starting 1 process
io: Laying out IO file (1 file / 512MiB)
Jobs: 1 (f=1): [r(1)][100.0%][r=32.6MiB/s][r=8354 IOPS][eta 00m:00s]
io: (groupid=0, jobs=1): err= 0: pid=16610: Fri Oct 10 12:05:45 2025
  read: IOPS=7967, BW=31.1MiB/s (32.6MB/s)(623MiB/20001msec)
    slat (usec): min=4, max=35962, avg=123.87, stdev=321.42
    clat (usec): min=2, max=50697, avg=7906.08, stdev=3495.07
     lat (usec): min=125, max=50891, avg=8029.93, stdev=3529.03
    clat percentiles (usec):
     |  1.00th=[ 4817],  5.00th=[ 5080], 10.00th=[ 5342], 20.00th=[ 7046],
     | 30.00th=[ 7439], 40.00th=[ 7504], 50.00th=[ 7570], 60.00th=[ 7832],
     | 70.00th=[ 8455], 80.00th=[ 8586], 90.00th=[ 8586], 95.00th=[ 8979],
     | 99.00th=[31327], 99.50th=[37487], 99.90th=[45351], 99.95th=[46924],
     | 99.99th=[49021]
   bw (  KiB/s): min=25704, max=48072, per=99.97%, avg=31874.43, stdev=6049.68, samples=40
   iops        : min= 6426, max=12018, avg=7968.58, stdev=1512.39, samples=40
  lat (usec)   : 4=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2=0.01%, 4=0.01%, 10=95.93%, 20=2.87%, 50=1.21%
  lat (msec)   : 100=0.01%
  cpu          : usr=1.50%, sys=29.70%, ctx=442497, majf=0, minf=37
  IO depths    : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued rwts: total=159354,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
   READ: bw=31.1MiB/s (32.6MB/s), 31.1MiB/s-31.1MiB/s (32.6MB/s-32.6MB/s), io=623MiB (653MB), run=20001-20001msec

RESULT: 8k IOPS, no process nearly maxes out the CPU. That’s 40 times less!

Observations while ZFS benchmark running:

zpool iostat rampool -qvy 1 30

                            capacity     operations     bandwidth    syncq_read    syncq_write   asyncq_read  asyncq_write   scrubq_read   trimq_write  rebuildq_write
pool                      alloc   free   read  write   read  write   pend  activ   pend  activ   pend  activ   pend  activ   pend  activ   pend  activ   pend  activ
------------------------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
rampool                   2.01G  7.49G  23.6K      0   629M      0      0      0      0      0      0      0      0      0      0      0      0      0      0      0
  /media/ramdisk/zfsfile  2.01G  7.49G  23.6K      0   629M      0      0      0      0      0      0      0      0      0      0      0      0      0      0      0
------------------------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----


zpool iostat rampool -qvl 3 30

                            capacity     operations     bandwidth    total_wait     disk_wait    syncq_wait    asyncq_wait  scrub   trim  rebuild   syncq_read    syncq_write   asyncq_read  asyncq_write   scrubq_read   trimq_write  rebuildq_write
pool                      alloc   free   read  write   read  write   read  write   read  write   read  write   read  write   wait   wait   wait   pend  activ   pend  activ   pend  activ   pend  activ   pend  activ   pend  activ   pend  activ
------------------------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
rampool                   2.01G  7.49G  19.5K      0   521M      0   19us      -   19us      -  642ns      -      -      -      -      -      -      0      0      0      0      0      0      0      0      0      0      0      0      0      0
  /media/ramdisk/zfsfile  2.01G  7.49G  19.5K      0   521M      0   19us      -   19us      -  642ns      -      -      -      -      -      -      0      0      0      0      0      0      0      0      0      0      0      0      0      0
------------------------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----


zpool iostat rampool -ry 3 30

rampool       sync_read    sync_write    async_read    async_write      scrub         trim         rebuild
req_size      ind    agg    ind    agg    ind    agg    ind    agg    ind    agg    ind    agg    ind    agg
----------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
512             0      0      0      0      0      0      0      0      0      0      0      0      0      0
1K              0      0      0      0      0      0      0      0      0      0      0      0      0      0
2K              0      0      0      0      0      0      0      0      0      0      0      0      0      0
4K          6.85K      0      0      0      0      0      0      0      0      0      0      0      0      0
8K              0      0      0      0      0      0      0      0      0      0      0      0      0      0
16K         6.85K      0      0      0      0      0      0      0      0      0      0      0      0      0
32K         6.85K      0      0      0      0      0      0      0      0      0      0      0      0      0
64K             0      0      0      0      0      0      0      0      0      0      0      0      0      0
128K            0      0      0      0      0      0      0      0      0      0      0      0      0      0
256K            0      0      0      0      0      0      0      0      0      0      0      0      0      0
512K            0      0      0      0      0      0      0      0      0      0      0      0      0      0
1M              0      0      0      0      0      0      0      0      0      0      0      0      0      0
2M              0      0      0      0      0      0      0      0      0      0      0      0      0      0
4M              0      0      0      0      0      0      0      0      0      0      0      0      0      0
8M              0      0      0      0      0      0      0      0      0      0      0      0      0      0
16M             0      0      0      0      0      0      0      0      0      0      0      0      0      0


IOTOP

Total DISK READ:       472.23 M/s | Total DISK WRITE:        56.22 K/s
Current DISK READ:       0.00 B/s | Current DISK WRITE:       3.27 M/s
    TID  PRIO  USER    DISK READ>  DISK WRITE    COMMAND
  26468 be/4 root      472.23 M/s    0.00 B/s fio --ioengine=libaio --name=io --rw=randread --bs=4k --iodepth=1 --ramp_time=2 --runtime=20 --filename=/rampool/testdataset/testfile --size=4k --time_based=1 --sync=0 --direct=0 --numjobs=1 --buffered=1
      1 be/4 root        0.00 B/s    0.00 B/s init
      2 be/4 root        0.00 B/s    0.00 B/s [kthreadd]
      3 be/4 root        0.00 B/s    0.00 B/s [pool_workqueue_release]
      4 be/0 root        0.00 B/s    0.00 B/s [kworker/R-rcu_gp]
      5 be/0 root        0.00 B/s    0.00 B/s [kworker/R-sync_wq]
      6 be/0 root        0.00 B/s    0.00 B/s [kworker/R-kvfree_rcu_reclaim]
      7 be/0 root        0.00 B/s    0.00 B/s [kworker/R-slub_flushwq]
      8 be/0 root        0.00 B/s    0.00 B/s [kworker/R-netns]
      ...

Some additional notes:

  • The results translate when going on NVMe instead of a RAM disk. 200k raw/native, 3k on ZFS. Even worse: Same on stripped mirrors – no way to pass the 3k.

  • As seen in both zpool iostsat and iotop, FIO read bandwidth (actual read) is a magnitude (~500M/s) higher than the printed IOPs / bandwidth result. Whats going on here?

  • please note the dataset properties set for testing – especially cache and recordsize.

  • setting primarycache=metadata delivers ~30k IOPS but that’s not the scope of this test, just for comparison.

Question:

Where could that HUGE overhead come from? Is FIO / the query wrong? It’s a bit unlikely since it brings the expected results on the raw (RAM) volume without ZFS involved.

Anything that can be done / how to diagnose further?

Admittedly, I am out of my element with this, but I will take a shot.

Comparing your fio runs, the CPU’s “ctx” values are very different. I think that is context switching.

Reading the following issue, it sounds like “primarycache=metadata” is effectively broken, and that ZFS requires “primarycache=all” to go high performance:

Again, if I am on the wrong track, I apologize in advance.

I would argue with this. Primarycache=metadata is not a good idea at this particular scale because it necessarily introduces additional compute latency while you figure out whether something is data or metadata and therefore whether you should or shouldn’t cache it.

However, I would say that the vast majority of real world ZFS use happens further away from the kind of compute and bus saturation that leaves no room for adding additional logic.

1 Like