ZFS - All NVME optimization

I’m setting up a new server with all-NVMe storage and wanted to get some baseline numbers before putting it into production.

Setup

  • Drives: Samsung PM9A3 (3.8TB)
  • Pool: 6 drives in ZFS RAID10 (mirror vdevs)
  • Ashift: 12
  • OS: Proxmox VE 9
  • ZFS settings: everything left at defaults unless noted

What I’m Seeing

  • Single drive with XFS → Performance looks great, right in line with the rated specs.
  • Same drives in ZFS RAID10 → Performance is really bad. Not just below expectations, but nowhere close to where it should be.

Benchmarking Method

I ran all tests with FIO using this base config

size=1G
ioengine=io_uring
iodepth=32
numjobs=1
direct=1
runtime=60
time_based

I varied the bs depending on the test. On the ZFS side I also played with:

  • primarycache
  • secondarycache
  • recordsize

Test charts are labeled as (FIO block size / ZFS recordsize) .

Example: (4k/128k) = 4k read/write with a 128k ZFS record size.


Question for the group:

Why would performance tank so badly when moving from single-drive XFS to a 6-disk ZFS RAID10 setup? Am I missing something obvious in ZFS tuning for NVMe, or could this be related to the way Proxmox/ZFS?

Any tips on what to check next (recordsize, caching, queue depths, etc.) would be much appreciated.


4k/128k record-size (NO Cache)
job         read_iops  read_MB/s  read_avg_ms  p99_read_ms  write_iops  write_MB/s  write_avg_ms  p99_write_ms
----------  ---------  ---------  -----------  -----------  ----------  ----------  ------------  ------------
seq_read        55607      227.8         0.57         1.27           0         0.0           0.0           0.0
seq_write           0        0.0          0.0          0.0      157557       645.4           0.2          0.54
rand_read       92540      379.0         0.34         0.63           0         0.0           0.0           0.0
rand_write          0        0.0          0.0          0.0        9265        38.0          3.45          11.6
rand_rw         25943      106.3         0.09          0.5       11125        45.6          2.66           9.9



4k/4k record-size (NO Cache)
job         read_iops  read_MB/s  read_avg_ms  p99_read_ms  write_iops  write_MB/s  write_avg_ms  p99_write_ms
----------  ---------  ---------  -----------  -----------  ----------  ----------  ------------  ------------
seq_read       115068      471.3         0.28         0.56           0         0.0           0.0           0.0
seq_write           0        0.0          0.0          0.0       28073       115.0          1.14          1.27
rand_read       98233      402.4         0.32         0.48           0         0.0           0.0           0.0
rand_write          0        0.0          0.0          0.0       23679        97.0          1.35          1.63
rand_rw         55780      228.5         0.09         0.14       23891        97.9          1.12           1.4


4k/4k record-size (w/Cache)
job         read_iops  read_MB/s  read_avg_ms  p99_read_ms  write_iops  write_MB/s  write_avg_ms  p99_write_ms
----------  ---------  ---------  -----------  -----------  ----------  ----------  ------------  ------------
seq_read       212126      868.9         0.15         0.32           0         0.0           0.0           0.0
seq_write           0        0.0          0.0          0.0       28480       116.7          1.12          1.29
rand_read      204612      838.1         0.15         0.24           0         0.0           0.0           0.0
rand_write          0        0.0          0.0          0.0       23328        95.6          1.37          1.61
rand_rw         55156      225.9         0.09         0.14       23621        96.8          1.14          1.38

4k/16k record-size (w/Cache)
job         read_iops  read_MB/s  read_avg_ms  p99_read_ms  write_iops  write_MB/s  write_avg_ms  p99_write_ms
----------  ---------  ---------  -----------  -----------  ----------  ----------  ------------  ------------
seq_read       159751      654.3          0.2         0.45           0         0.0           0.0           0.0
seq_write           0        0.0          0.0          0.0      274093      1122.7          0.12          0.33
rand_read      180406      738.9         0.18         0.29           0         0.0           0.0           0.0
rand_write          0        0.0          0.0          0.0       91839       376.2          0.35          4.55
rand_rw        115432      472.8         0.17         0.66       49509       202.8          0.24          5.41

4k/16k record-size (NO Cache)
job         read_iops  read_MB/s  read_avg_ms  p99_read_ms  write_iops  write_MB/s  write_avg_ms  p99_write_ms
----------  ---------  ---------  -----------  -----------  ----------  ----------  ------------  ------------
seq_read       109383      448.0         0.29         0.64           0         0.0           0.0           0.0
seq_write           0        0.0          0.0          0.0       77803       318.7          0.41          0.87
rand_read       97661      400.0         0.33          0.5           0         0.0           0.0           0.0
rand_write          0        0.0          0.0          0.0       15276        62.6          2.09          6.91
rand_rw         33002      135.2         0.07         0.18       14147        57.9          2.08          6.85

XFS NVE drive 4k write
job         read_iops  read_MB/s  read_avg_ms  p99_read_ms  write_iops  write_MB/s  write_avg_ms  p99_write_ms
----------  ---------  ---------  -----------  -----------  ----------  ----------  ------------  ------------
seq_read       573044     2347.2         0.05         0.06           0         0.0           0.0           0.0
seq_write           0        0.0          0.0          0.0      391306      1602.8          0.08           0.1
rand_read      433023     1773.7         0.07         0.13           0         0.0           0.0           0.0
rand_write          0        0.0          0.0          0.0      354747      1453.0          0.09          0.11
rand_rw        197536      809.1         0.15         0.52       84672       346.8          0.02          0.04

XFS NVE drive 128k write
job         read_iops  read_MB/s  read_avg_ms  p99_read_ms  write_iops  write_MB/s  write_avg_ms  p99_write_ms
----------  ---------  ---------  -----------  -----------  ----------  ----------  ------------  ------------
seq_read        50125     6570.1         0.63         0.88           0         0.0           0.0           0.0
seq_write           0        0.0          0.0          0.0       31400      4115.8          1.02          1.09
rand_read       43914     5756.1         0.72         1.97           0         0.0           0.0           0.0
rand_write          0        0.0          0.0          0.0       12220      1601.9          2.61          3.59
rand_rw         15551     2038.4          0.9         1.88        6667       874.0          2.69          3.62

Just today 8x 3,6TB mdadm raid10+xfs, fio seq_read >34GB/s, seq_write >14,5GB/s (=write >3,6GB/s/nvme) as even seen these throughput numbers steady in parallel with iostat -xm 1.

For comparison and troubleshooting, you may want to consider also benchmarking with each of the following:

  1. Single drive with ZFS
  2. Same drives in mdraid RAID 10 and XFS
1 Like