Performance on Intel DC P4510

Cupbearer · July 13, 2023, 5:49pm

I spent almost 2000 euros on 4 enterprise ssd’s, Intel DC P4510, them some on 5 consumer SATA ssd’s to replace a pool of 8x 8TB spinning rust. The reasoning is that I wanted to split the single large pool into a “backup/long storage/archive” pool and “work” pool. I always found some aspects of performance with zfs lackluster; mainly small block random reads and writes. But the spinners werent fast to begin with and all things considered, benefits of zfs outweighed the small performance hit.

Except, the performance hit on these ssd’s is … rediculous? I must be doing something wrong so please help me out!

I tested each Intel SSD as a single disk, each drive performance is on par with its specs. For example:
fio --name=randwrite_t64qd256 --rw=randwrite --end_fsync=1 --direct=1 --bs=4k --ioengine=libaio --size=10G --numjobs=64 --iodepth=256

Note: I am aware that that end_fsync is probably useless with direct, but it also doesn’t hurt?
Note2: I also tested with posixaio, performance is roughly half of libaio
Note3: I tested with t1qd1 and moving up to 64/256. Drive didn’t “choke”. As can be read on other sites as well, these drives won’t saturate until higher numbers are used (makes sense with bs=4k). t8qd16 and t16qd64 gave similar results.
Note4: I tested with fio direct to device as well (/dev/nvmeXn1). That hits the theoretical limits of the drive.

All results below are from the above fio command. I’ve been at this the better part of the day with different fio commands, all results are very similar to these.

Single disk, direct to device:

IOPS	    MB/s
524000      2052

Single disk, EXT4:

IOPS	    MB/s
252267	    985,6

ZFS 2x mirror, rs=64k:

IOPS	    MB/s
17154	    67

mdadm raid10, different configs, ext4, all gave similar results:

IOPS	    MB/s
172776	    674,86

btrfs raid10:

IOPS	    MB/s
117073	    457,44

What am I missing here? I must be doing something wrong, right?

Why is every test I do slower than single disk performance?
And why is ZFS so much worse than the rest? Its rediculous …

When I started searching for this it turned out I am not alone. Plenty of forums (r/zfs ), zfs github, sth, etc. As I found something on the zfs mailinglist and github issues, I tried out some things.

cat /sys/kernel/mm/ksm/pages_shared
cat /sys/kernel/mm/ksm/pages_sharing
cat /sys/kernel/mm/ksm/full_scans
cat /sys/kernel/mm/ksm/run

Should all be 0, and they are.

cat /etc/modprobe.d/zfs-options.conf

options zfs zfs_arc_min=4294967296        # min 4GB
options zfs zfs_arc_max=68719476736       # max 64GB
options zfs zfs_compressed_arc_enabled=0  # Disable compressed ARC
options zfs zfs_abd_scatter_enabled=0     # Disable ADB Scatter
options zfs zvol_prefetch_bytes=1048576   # 1MB zvol prefetch

I also tried adding these to the above:

# EXPERIMENTAL
options zfs zfs_dirty_data_max_percent=30
options zfs zfs_txg_timeout=100
options zfs zfs_vdev_async_read_max_active=2048
options zfs zfs_vdev_async_read_min_active=1024
options zfs zfs_vdev_async_write_max_active=2048
options zfs zfs_vdev_async_write_min_active=1024
options zfs zfs_vdev_queue_depth_pct=100
options zfs zfs_vdev_sync_read_max_active=2048
options zfs zfs_vdev_sync_read_min_active=1024
options zfs zfs_vdev_sync_write_max_active=2048
options zfs zfs_vdev_sync_write_min_active=1024

Nothing mattered (much).

Cupbearer · July 14, 2023, 9:23am

So Im still having a go on this.

This is the fio output of 1 job, single disk ext4:
# fio --name=randwrite4k --rw=randwrite --end_fsync=1 --bs=4k --ioengine=libaio --size=100G --numjobs=4 --iodepth=64

randwrite4k: (groupid=0, jobs=1): err= 0: pid=3873029: Wed Jul 12 18:39:47 2023
  write: IOPS=97.5k, BW=381MiB/s (400MB/s)(100GiB/268734msec); 0 zone resets
    slat (nsec): min=1791, max=51693k, avg=8753.03, stdev=309844.75
    clat (nsec): min=1624, max=52071k, avg=627684.27, stdev=2456463.64
     lat (usec): min=7, max=52074, avg=636.44, stdev=2475.79
    clat percentiles (usec):
     |  1.00th=[  212],  5.00th=[  223], 10.00th=[  237], 20.00th=[  260],
     | 30.00th=[  285], 40.00th=[  314], 50.00th=[  338], 60.00th=[  359],
     | 70.00th=[  383], 80.00th=[  420], 90.00th=[  502], 95.00th=[  570],
     | 99.00th=[18744], 99.50th=[23200], 99.90th=[27395], 99.95th=[28443],
     | 99.99th=[30802]
   bw (  KiB/s): min=40144, max=1055376, per=26.20%, avg=402095.65, stdev=114228.86, samples=521
   iops        : min=10036, max=263844, avg=100523.89, stdev=28557.23, samples=521
  lat (usec)   : 2=0.01%, 10=0.01%, 20=0.01%, 50=0.01%, 100=0.01%
  lat (usec)   : 250=15.89%, 500=73.94%, 750=7.78%, 1000=0.54%
  lat (msec)   : 2=0.34%, 4=0.22%, 10=0.09%, 20=0.32%, 50=0.88%
  lat (msec)   : 100=0.01%
  cpu          : usr=11.28%, sys=42.93%, ctx=131628, majf=0, minf=11
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued rwts: total=0,26214400,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=64

Compare that with this one:
root@paradise ~/test # fio --name=randwrite_t64qd256 --rw=randwrite --end_fsync=1 --direct=1 --bs=4k --ioengine=libaio --size=10G --numjobs=64 --iodepth=256 --time_based=1 --runtime=300

randwrite_t64qd256: (groupid=0, jobs=1): err= 0: pid=4048277: Fri Jul 14 11:02:18 2023
  write: IOPS=4015, BW=15.7MiB/s (16.4MB/s)(4709MiB/300228msec); 0 zone resets
    slat (usec): min=5, max=507423, avg=84.55, stdev=3296.02
    clat (nsec): min=960, max=11622M, avg=63662684.84, stdev=181892728.83
     lat (usec): min=26, max=11622k, avg=63747.24, stdev=181929.43
    clat percentiles (usec):
     |  1.00th=[     60],  5.00th=[    322], 10.00th=[    750],
     | 20.00th=[   1401], 30.00th=[   2114], 40.00th=[   2769],
     | 50.00th=[   3326], 60.00th=[   3785], 70.00th=[   4621],
     | 80.00th=[  10421], 90.00th=[ 229639], 95.00th=[ 574620],
     | 99.00th=[ 775947], 99.50th=[ 884999], 99.90th=[1300235],
     | 99.95th=[1602225], 99.99th=[2432697]
   bw (  KiB/s): min=    8, max=302632, per=1.63%, avg=16261.89, stdev=26034.18, samples=593
   iops        : min=    2, max=75658, avg=4065.41, stdev=6508.53, samples=593
  lat (nsec)   : 1000=0.01%
  lat (usec)   : 2=0.01%, 4=0.01%, 20=0.01%, 50=0.83%, 100=0.91%
  lat (usec)   : 250=2.28%, 500=3.13%, 750=2.85%, 1000=2.81%
  lat (msec)   : 2=15.64%, 4=35.64%, 10=15.57%, 20=4.70%, 50=2.95%
  lat (msec)   : 100=0.49%, 250=2.61%, 500=3.51%, 750=4.92%, 1000=0.91%
  lat (msec)   : 2000=0.22%, >=2000=0.02%
  cpu          : usr=0.72%, sys=6.46%, ctx=236633, majf=0, minf=12
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
     issued rwts: total=0,1205585,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=256

Same command on mdadm raid10 ext4:

randwrite_t64qd256: (groupid=0, jobs=1): err= 0: pid=278610: Thu Jul 13 16:12:21 2023
  write: IOPS=2679, BW=10.5MiB/s (11.0MB/s)(3142MiB/300229msec); 0 zone resets
    slat (usec): min=8, max=1060.0k, avg=363.38, stdev=9893.57
    clat (nsec): min=985, max=1147.2M, avg=95140049.19, stdev=153367241.85
     lat (usec): min=29, max=1147.3k, avg=95503.43, stdev=153636.37
    clat percentiles (msec):
     |  1.00th=[    8],  5.00th=[    9], 10.00th=[   10], 20.00th=[   13],
     | 30.00th=[   17], 40.00th=[   21], 50.00th=[   28], 60.00th=[   37],
     | 70.00th=[   50], 80.00th=[   77], 90.00th=[  388], 95.00th=[  435],
     | 99.00th=[  518], 99.50th=[  625], 99.90th=[ 1003], 99.95th=[ 1083],
     | 99.99th=[ 1133]
   bw (  KiB/s): min=    8, max=78744, per=1.58%, avg=10889.99, stdev=7895.15, samples=590
   iops        : min=    2, max=19686, avg=2722.39, stdev=1973.80, samples=590
  lat (nsec)   : 1000=0.01%
  lat (usec)   : 2=0.01%, 20=0.01%, 50=0.01%, 100=0.01%, 250=0.02%
  lat (usec)   : 500=0.01%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2=0.02%, 4=0.05%, 10=11.08%, 20=27.50%, 50=31.55%
  lat (msec)   : 100=11.84%, 250=0.74%, 500=15.74%, 750=0.95%, 1000=0.36%
  lat (msec)   : 2000=0.10%
  cpu          : usr=0.56%, sys=7.71%, ctx=562759, majf=0, minf=13
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
     issued rwts: total=0,804346,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=256

and finally, zfs:

randwrite_t64qd256: (groupid=0, jobs=1): err= 0: pid=11728: Thu Jul 13 17:16:01 2023
  write: IOPS=846, BW=3387KiB/s (3468kB/s)(994MiB/300380msec); 0 zone resets
    slat (usec): min=9, max=250932, avg=1163.39, stdev=6194.68
    clat (usec): min=5, max=770787, avg=300621.30, stdev=107685.65
     lat (usec): min=790, max=771327, avg=301784.69, stdev=107629.21
    clat percentiles (msec):
     |  1.00th=[   65],  5.00th=[  121], 10.00th=[  153], 20.00th=[  203],
     | 30.00th=[  243], 40.00th=[  279], 50.00th=[  305], 60.00th=[  334],
     | 70.00th=[  363], 80.00th=[  397], 90.00th=[  439], 95.00th=[  464],
     | 99.00th=[  527], 99.50th=[  575], 99.90th=[  684], 99.95th=[  718],
     | 99.99th=[  768]
   bw (  KiB/s): min= 1232, max= 7104, per=1.56%, avg=3393.92, stdev=926.82, samples=586
   iops        : min=  308, max= 1776, avg=848.23, stdev=231.72, samples=586
  lat (usec)   : 10=0.01%, 1000=0.01%
  lat (msec)   : 2=0.01%, 10=0.03%, 20=0.02%, 50=0.45%, 100=2.48%
  lat (msec)   : 250=28.92%, 500=66.52%, 750=1.54%, 1000=0.03%
  cpu          : usr=0.31%, sys=3.75%, ctx=294368, majf=0, minf=9
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
     issued rwts: total=0,254340,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=256

I hadn’t noticed it before, but the jump in latencies from micro to milliseconds …

This made me think about what could be causing this and maybe I’m about to talk out of my ass but perhaps this is a hardware limit Im hitting… could it be caused by pci bridge/controller (cpu?) being unable to keep up with the random I/O, having to constantly switch between PCI lanes? With a single disk it just has to open 4 lanes while fio does its randomness, but on a stripe set, it receives a request, sends it to the the correct disk and has to wait for fio to hand it the next request, which might be on another 4 lanes (ie, other disk) it doesn’t know where to go until it gets the request so its constantly swithing pci lanes. Unlike with sequential data, there’s no predicting or buffering the I/O and in the case of sequential, even when striped, it might be able to keep 2x 4 lanes open but with random, it starts switching between the 2 sets of 4 lanes and it gets choked up? Does that make sense or … ?

I will test with just a mirror, no striping, raid1 instead of raid10.

Cupbearer · July 15, 2023, 2:58pm

Mirrors perform just the same, mdadm or zfs its the same as with raid10.

I left the mdadm raid10 overnight. This morning I had to reboot, clean shutdown and mdadm came back missing 1 disk. I couldn’t figure out why, it just said 1 disk was removed. I added it back and it started resyncing but I didn’t wait for that. I ditched mdadm and its back to zfs, even if its slow under some circumstances.

This is the last bit money I will spend on this, I got this card to hook up the 4 U.2 SSD’s.

This and similar cards have been discussed over at STH quite extensively, and have been tested really well.

If I am right about the lane switching this card might just solve that issue since it doesn’t rely on bifurcation. Each port has the full 16 lanes available.