So Im still having a go on this.
This is the fio output of 1 job, single disk ext4:
# fio --name=randwrite4k --rw=randwrite --end_fsync=1 --bs=4k --ioengine=libaio --size=100G --numjobs=4 --iodepth=64
randwrite4k: (groupid=0, jobs=1): err= 0: pid=3873029: Wed Jul 12 18:39:47 2023
write: IOPS=97.5k, BW=381MiB/s (400MB/s)(100GiB/268734msec); 0 zone resets
slat (nsec): min=1791, max=51693k, avg=8753.03, stdev=309844.75
clat (nsec): min=1624, max=52071k, avg=627684.27, stdev=2456463.64
lat (usec): min=7, max=52074, avg=636.44, stdev=2475.79
clat percentiles (usec):
| 1.00th=[ 212], 5.00th=[ 223], 10.00th=[ 237], 20.00th=[ 260],
| 30.00th=[ 285], 40.00th=[ 314], 50.00th=[ 338], 60.00th=[ 359],
| 70.00th=[ 383], 80.00th=[ 420], 90.00th=[ 502], 95.00th=[ 570],
| 99.00th=[18744], 99.50th=[23200], 99.90th=[27395], 99.95th=[28443],
| 99.99th=[30802]
bw ( KiB/s): min=40144, max=1055376, per=26.20%, avg=402095.65, stdev=114228.86, samples=521
iops : min=10036, max=263844, avg=100523.89, stdev=28557.23, samples=521
lat (usec) : 2=0.01%, 10=0.01%, 20=0.01%, 50=0.01%, 100=0.01%
lat (usec) : 250=15.89%, 500=73.94%, 750=7.78%, 1000=0.54%
lat (msec) : 2=0.34%, 4=0.22%, 10=0.09%, 20=0.32%, 50=0.88%
lat (msec) : 100=0.01%
cpu : usr=11.28%, sys=42.93%, ctx=131628, majf=0, minf=11
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
issued rwts: total=0,26214400,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=64
Compare that with this one:
root@paradise ~/test # fio --name=randwrite_t64qd256 --rw=randwrite --end_fsync=1 --direct=1 --bs=4k --ioengine=libaio --size=10G --numjobs=64 --iodepth=256 --time_based=1 --runtime=300
randwrite_t64qd256: (groupid=0, jobs=1): err= 0: pid=4048277: Fri Jul 14 11:02:18 2023
write: IOPS=4015, BW=15.7MiB/s (16.4MB/s)(4709MiB/300228msec); 0 zone resets
slat (usec): min=5, max=507423, avg=84.55, stdev=3296.02
clat (nsec): min=960, max=11622M, avg=63662684.84, stdev=181892728.83
lat (usec): min=26, max=11622k, avg=63747.24, stdev=181929.43
clat percentiles (usec):
| 1.00th=[ 60], 5.00th=[ 322], 10.00th=[ 750],
| 20.00th=[ 1401], 30.00th=[ 2114], 40.00th=[ 2769],
| 50.00th=[ 3326], 60.00th=[ 3785], 70.00th=[ 4621],
| 80.00th=[ 10421], 90.00th=[ 229639], 95.00th=[ 574620],
| 99.00th=[ 775947], 99.50th=[ 884999], 99.90th=[1300235],
| 99.95th=[1602225], 99.99th=[2432697]
bw ( KiB/s): min= 8, max=302632, per=1.63%, avg=16261.89, stdev=26034.18, samples=593
iops : min= 2, max=75658, avg=4065.41, stdev=6508.53, samples=593
lat (nsec) : 1000=0.01%
lat (usec) : 2=0.01%, 4=0.01%, 20=0.01%, 50=0.83%, 100=0.91%
lat (usec) : 250=2.28%, 500=3.13%, 750=2.85%, 1000=2.81%
lat (msec) : 2=15.64%, 4=35.64%, 10=15.57%, 20=4.70%, 50=2.95%
lat (msec) : 100=0.49%, 250=2.61%, 500=3.51%, 750=4.92%, 1000=0.91%
lat (msec) : 2000=0.22%, >=2000=0.02%
cpu : usr=0.72%, sys=6.46%, ctx=236633, majf=0, minf=12
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
issued rwts: total=0,1205585,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=256
Same command on mdadm raid10 ext4:
randwrite_t64qd256: (groupid=0, jobs=1): err= 0: pid=278610: Thu Jul 13 16:12:21 2023
write: IOPS=2679, BW=10.5MiB/s (11.0MB/s)(3142MiB/300229msec); 0 zone resets
slat (usec): min=8, max=1060.0k, avg=363.38, stdev=9893.57
clat (nsec): min=985, max=1147.2M, avg=95140049.19, stdev=153367241.85
lat (usec): min=29, max=1147.3k, avg=95503.43, stdev=153636.37
clat percentiles (msec):
| 1.00th=[ 8], 5.00th=[ 9], 10.00th=[ 10], 20.00th=[ 13],
| 30.00th=[ 17], 40.00th=[ 21], 50.00th=[ 28], 60.00th=[ 37],
| 70.00th=[ 50], 80.00th=[ 77], 90.00th=[ 388], 95.00th=[ 435],
| 99.00th=[ 518], 99.50th=[ 625], 99.90th=[ 1003], 99.95th=[ 1083],
| 99.99th=[ 1133]
bw ( KiB/s): min= 8, max=78744, per=1.58%, avg=10889.99, stdev=7895.15, samples=590
iops : min= 2, max=19686, avg=2722.39, stdev=1973.80, samples=590
lat (nsec) : 1000=0.01%
lat (usec) : 2=0.01%, 20=0.01%, 50=0.01%, 100=0.01%, 250=0.02%
lat (usec) : 500=0.01%, 750=0.01%, 1000=0.01%
lat (msec) : 2=0.02%, 4=0.05%, 10=11.08%, 20=27.50%, 50=31.55%
lat (msec) : 100=11.84%, 250=0.74%, 500=15.74%, 750=0.95%, 1000=0.36%
lat (msec) : 2000=0.10%
cpu : usr=0.56%, sys=7.71%, ctx=562759, majf=0, minf=13
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
issued rwts: total=0,804346,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=256
and finally, zfs:
randwrite_t64qd256: (groupid=0, jobs=1): err= 0: pid=11728: Thu Jul 13 17:16:01 2023
write: IOPS=846, BW=3387KiB/s (3468kB/s)(994MiB/300380msec); 0 zone resets
slat (usec): min=9, max=250932, avg=1163.39, stdev=6194.68
clat (usec): min=5, max=770787, avg=300621.30, stdev=107685.65
lat (usec): min=790, max=771327, avg=301784.69, stdev=107629.21
clat percentiles (msec):
| 1.00th=[ 65], 5.00th=[ 121], 10.00th=[ 153], 20.00th=[ 203],
| 30.00th=[ 243], 40.00th=[ 279], 50.00th=[ 305], 60.00th=[ 334],
| 70.00th=[ 363], 80.00th=[ 397], 90.00th=[ 439], 95.00th=[ 464],
| 99.00th=[ 527], 99.50th=[ 575], 99.90th=[ 684], 99.95th=[ 718],
| 99.99th=[ 768]
bw ( KiB/s): min= 1232, max= 7104, per=1.56%, avg=3393.92, stdev=926.82, samples=586
iops : min= 308, max= 1776, avg=848.23, stdev=231.72, samples=586
lat (usec) : 10=0.01%, 1000=0.01%
lat (msec) : 2=0.01%, 10=0.03%, 20=0.02%, 50=0.45%, 100=2.48%
lat (msec) : 250=28.92%, 500=66.52%, 750=1.54%, 1000=0.03%
cpu : usr=0.31%, sys=3.75%, ctx=294368, majf=0, minf=9
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
issued rwts: total=0,254340,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=256
I hadn’t noticed it before, but the jump in latencies from micro to milliseconds …
This made me think about what could be causing this and maybe I’m about to talk out of my ass but perhaps this is a hardware limit Im hitting… could it be caused by pci bridge/controller (cpu?) being unable to keep up with the random I/O, having to constantly switch between PCI lanes? With a single disk it just has to open 4 lanes while fio does its randomness, but on a stripe set, it receives a request, sends it to the the correct disk and has to wait for fio to hand it the next request, which might be on another 4 lanes (ie, other disk) it doesn’t know where to go until it gets the request so its constantly swithing pci lanes. Unlike with sequential data, there’s no predicting or buffering the I/O and in the case of sequential, even when striped, it might be able to keep 2x 4 lanes open but with random, it starts switching between the 2 sets of 4 lanes and it gets choked up? Does that make sense or … ? 
I will test with just a mirror, no striping, raid1 instead of raid10.