Troubleshooting poor performance of zvols over iscsi

I’m having a weird problem regarding pool performance. I decided one day to add another disk to my existing 2-way mirror and expected zvol read performance to get better. I also removed L2ARC and SLOG (both m.2 nvme) for testing purposes only to see that none of these changes made any difference to zvol performance (read and writes were unaffected). I’m using this zvol over ISCSI and I use targetcli to manage the settings on the iscsi side. I tested to set sync=disabled and sync=always for the tested zvol only but I observed absolutely no difference in write performance over iscsi to that zvol. I tested creating a ramdisk backstore through targetcli and using that I get full 10g network bandwidth on writes and reads so the problem is not on the network side. Any ideas where to look next?

What’s the volblocksize on that zvol, and what workload are you testing it with?

If you’ve got a small volblocksize (which most do), you’re probably hitting IOPS limits. And if you’re testing with a single-process read, you’re probably not able to get much in the way of prefetch to take advantage of additional vdevs (you would see the performance increase with higher parallelism in the storage workload).

Yes, I was using a single ZVOL used through iSCSI for these tests. I did notice that most of this problem was related to the volblocksize of the zvol, whick I subsequently increased to 32k and formatted the zvol with NTFS using allocation unit size of 32k respectively. After these changes the performace almost doubled and I started to experience benefits from the slog.

P.S. Can you tell if it is possible to disable iSCSI LIO target write through cache on the zvol backstore such that the system would use just the zfs arc? I’m using targetcli for the configuration (as probably most do).

Sorry; I don’t have extensive iSCSI experience, so I can’t help way down in the weeds on that side of things.

Increasing volblocksize to 32K was a good move, and would have been even without the NTFS tuning. How are you testing performance after your changes?

I’m running read and write tests using crystal disk mark on Windows box (physical box with 10g nic). The current setup is meant to serve as generic disk server and running light virtualization workloads including ubiquiti device controller as container, PfSense firewall etc. Virtualization workloads are running from separate 2tb 2-way nvme mirror pool. Cpu is 32 core epyc 7551P @ 2GHz. It currently has 128gb ram of which 110gb I have allocated for the zfs arc. Arc hit rates show consistent high 80% and now I’m relatively happy with the performance (at least for single host scenario). I’m possibly moving the iSCSI storage workload to a dedicated storage server soon(ish) with a cpu with better single core performance and possibly 256gb ram and intel P3700 slog. For the storage itself I’m planning of running possibly 2 3 way 8tb sas disks. I’m also using Mellanox 40gb nic on the server side currently and in the future as it shows excellent performance and compatibility. The actual use for this storage includes running bigger vm disks and some virtual disks for physical windows hosts for mixed use.

So, IIRC CrystalDiskMark does a lot of single-process tests. You’re not going to see those improve much, if any, from adding or expanding vdevs. You see big improvements from topology changes with multi-process workloads, for the most part.

If you pump iodepth up high enough, you’ll start to see improvement on single-process workloads as well–but it comes at a cost; high iodepth increases latency as well as throughput.

You don’t generally get to control the iodepth of your real workloads, mind you; that’s generally something set by the developers of the software you’re using and not often something you can monkey with easily yourself. It’s easy enough to manipulate for your test workloads if you’re using a proper storage benchmark like fio, though, in which case you’d generally want to set iodepth to a figure that roughly the way your normal workload behaves.

Recently I’ve been testing zvol performance on a zfs SSD raid0 stripe.

  • I compared performance of qcow2 / zvol & lvm + the various volblocksize options on the zvols:
    • 16K / 32K / 64K / 128K
  • I also tested virtio-blk versus virtio-scsi
  • The zvol performance is so good I’m going to stop passing through my 2nd nvme to Windows & instead put the os on a zvol on the nvme
  • the configuration below showed basically bare metal read performance in the builtin winsat tool :
> Disk  Random 16.0 Read                       1155.78 MB/s          8.9
> Disk  Sequential 64.0 Read                   8193.54 MB/s          9.9
> Disk  Sequential 64.0 Write                  5185.20 MB/s          9.7

I am more interested in speed than redundancy. The 2 x SSDs are a striped pool for a SSD Steam Library but are also partitioned to provide special devices for my main mirrored pool on spinning sata + new sata stripe for Windows / Steam

  • I created the striped (raid0) SSD pool with:
zpool create -f -o ashift=12 -m /mnt/ssd1 ssd1 \
    ata-INTEL_SSDSC2KG019T8_PHYG9170009C1P9DGN-part4 \
    ata-SAMSUNG_MZ7KH1T9HAJR-00005_S47PNE0M508088-part4
  • I created a -s (sparse) zvol with a 64K volblocksize:
zfs create -o volblocksize=64K -s -V 3.16TB ssd1/windows

On the zvol I created:

  • 16MB partition type Microsoft reserved (type 10 in fdisk)
  • balance of the space as partition type Microsoft basic data (type 11 in fdisk)
  • inside my Windows vm in Disk Management I formatted the zvol (which I think uses a 64K cluster size by default)
  • tested performance with the built in Windows tool winsat

First round of testing:

  • SSD => qcow2 / zvol / lvm
  • Existing SCSI whole device passthrough of a Toshiba Enterprise SATA as a comparision (which is in the process of being migrated to a zvol on a SATA stripe with a striped special device on the SSD (I have approx 9.9 petabytes of writes left on the SSDs)
2 x 1.92TB Enterprise SSD tests
===============================

QCOW2:
========================
Z:\>winsat disk -drive Q

Windows System Assessment Tool
> Running: Feature Enumeration ''
> Run Time 00:00:00.00
> Running: Storage Assessment '-drive Q -ran -read'
> Run Time 00:00:00.41
> Running: Storage Assessment '-drive Q -seq -read'
> Run Time 00:00:01.16
> Running: Storage Assessment '-drive Q -seq -write'
> Run Time 00:00:00.73
> Running: Storage Assessment '-drive Q -flush -seq'
> Run Time 00:00:00.45
> Running: Storage Assessment '-drive Q -flush -ran'
> Run Time 00:00:00.42
> Dshow Video Encode Time                      0.00000 s
> Dshow Video Decode Time                      0.00000 s
> Media Foundation Decode Time                 0.00000 s
> Disk  Random 16.0 Read                       59.18 MB/s          6.7
> Disk  Sequential 64.0 Read                   3156.85 MB/s          9.3
> Disk  Sequential 64.0 Write                  2859.21 MB/s          9.2
> Average Read Time with Sequential Writes     0.097 ms          8.8
> Latency: 95th Percentile                     0.172 ms          8.9
> Latency: Maximum                             2.063 ms          8.8
> Average Read Time with Random Writes         0.099 ms          8.9
> Total Run Time 00:00:03.33

-------------------------------------------------------------------------------------------

LVM STRIPE
========================

Z:\>winsat disk -drive R
Windows System Assessment Tool
> Running: Feature Enumeration ''
> Run Time 00:00:00.00
> Running: Storage Assessment '-drive R -ran -read'
> Run Time 00:00:00.13
> Running: Storage Assessment '-drive R -seq -read'
> Run Time 00:00:01.47
> Running: Storage Assessment '-drive R -seq -write'
> Run Time 00:00:01.33
> Running: Storage Assessment '-drive R -flush -seq'
> Run Time 00:00:00.66
> Running: Storage Assessment '-drive R -flush -ran'
> Run Time 00:00:00.66
> Dshow Video Encode Time                      0.00000 s
> Dshow Video Decode Time                      0.00000 s
> Media Foundation Decode Time                 0.00000 s
> Disk  Random 16.0 Read                       433.61 MB/s          8.2
> Disk  Sequential 64.0 Read                   757.50 MB/s          8.3
> Disk  Sequential 64.0 Write                  672.60 MB/s          8.2
> Average Read Time with Sequential Writes     0.222 ms          8.6
> Latency: 95th Percentile                     0.416 ms          8.7
> Latency: Maximum                             3.967 ms          8.6
> Average Read Time with Random Writes         0.242 ms          8.8
> Total Run Time 00:00:04.34


ZVOL:
=========================

Z:\>winsat disk -drive V
Windows System Assessment Tool
> Running: Feature Enumeration ''
> Run Time 00:00:00.00
> Running: Storage Assessment '-drive V -ran -read'
> Run Time 00:00:00.11
> Running: Storage Assessment '-drive V -seq -read'
> Run Time 00:00:01.19
> Running: Storage Assessment '-drive V -seq -write'
> Run Time 00:00:00.75
> Running: Storage Assessment '-drive V -flush -seq'
> Run Time 00:00:00.70
> Running: Storage Assessment '-drive V -flush -ran'
> Run Time 00:00:00.47
> Dshow Video Encode Time                      0.00000 s
> Dshow Video Decode Time                      0.00000 s
> Media Foundation Decode Time                 0.00000 s
> Disk  Random 16.0 Read                       844.00 MB/s          8.6
> Disk  Sequential 64.0 Read                   2533.42 MB/s          9.1
> Disk  Sequential 64.0 Write                  2757.42 MB/s          9.2
> Average Read Time with Sequential Writes     0.097 ms          8.8
> Latency: 95th Percentile                     0.133 ms          8.9
> Latency: Maximum                             0.421 ms          8.9
> Average Read Time with Random Writes         0.097 ms          8.9
> Total Run Time 00:00:03.31


SATA HARD DRIVE PASSTHROUGH
===========================

Z:\>winsat disk -drive E
Windows System Assessment Tool
> Running: Feature Enumeration ''
> Run Time 00:00:00.00
> Running: Storage Assessment '-drive E -ran -read'
> Run Time 00:00:07.44
> Running: Storage Assessment '-drive E -seq -read'
> Run Time 00:00:03.25
> Running: Storage Assessment '-drive E -seq -write'
> Run Time 00:00:03.92
> Running: Storage Assessment '-drive E -flush -seq'
> Run Time 00:00:06.39
> Running: Storage Assessment '-drive E -flush -ran'
> Run Time 00:00:07.41
> Dshow Video Encode Time                      0.00000 s
> Dshow Video Decode Time                      0.00000 s
> Media Foundation Decode Time                 0.00000 s
> Disk  Random 16.0 Read                       2.13 MB/s          4.3
> Disk  Sequential 64.0 Read                   136.18 MB/s          7.0
> Disk  Sequential 64.0 Write                  155.47 MB/s          7.1
> Average Read Time with Sequential Writes     6.563 ms          5.5
> Latency: 95th Percentile                     19.058 ms          4.7
> Latency: Maximum                             90.300 ms          7.7
> Average Read Time with Random Writes         7.283 ms          5.2
> Total Run Time 00:00:28.52

With zvol the clear performance winner I then experimented with various blocksizes (16K / 32K / 64K / 128K):

  • TLDR: 64K block sizes won as I let Windows format the drive with the default NTFS cluster size of 64K
16K BLOCKSIZE
=============

PS C:\WINDOWS\system32> winsat disk -drive G
Windows System Assessment Tool
> Running: Feature Enumeration ''
> Run Time 00:00:00.00
> Running: Storage Assessment '-drive G -ran -read'
> Run Time 00:00:00.30
> Running: Storage Assessment '-drive G -seq -read'
> Run Time 00:00:08.41
> Running: Storage Assessment '-drive G -seq -write'
> Run Time 00:00:33.24
> Running: Storage Assessment '-drive G -flush -seq'
> Run Time 00:00:00.72
> Running: Storage Assessment '-drive G -flush -ran'
> Run Time 00:00:00.45
> Dshow Video Encode Time                      0.00000 s
> Dshow Video Decode Time                      0.00000 s
> Media Foundation Decode Time                 0.00000 s
> Disk  Random 16.0 Read                       870.43 MB/s          8.7
> Disk  Sequential 64.0 Read                   4390.17 MB/s          9.5
> Disk  Sequential 64.0 Write                  2799.89 MB/s          9.2
> Average Read Time with Sequential Writes     0.084 ms          8.8
> Latency: 95th Percentile                     0.123 ms          8.9
> Latency: Maximum                             0.561 ms          8.9
> Average Read Time with Random Writes         0.090 ms          8.9
> Total Run Time 00:00:43.22


32K BLOCKSIZE
=============

PS C:\WINDOWS\system32> winsat disk -drive H
Windows System Assessment Tool
> Running: Feature Enumeration ''
> Run Time 00:00:00.00
> Running: Storage Assessment '-drive H -ran -read'
> Run Time 00:00:00.11
> Running: Storage Assessment '-drive H -seq -read'
> Run Time 00:00:01.16
> Running: Storage Assessment '-drive H -seq -write'
> Run Time 00:00:00.76
> Running: Storage Assessment '-drive H -flush -seq'
> Run Time 00:00:00.42
> Running: Storage Assessment '-drive H -flush -ran'
> Run Time 00:00:00.41
> Dshow Video Encode Time                      0.00000 s
> Dshow Video Decode Time                      0.00000 s
> Media Foundation Decode Time                 0.00000 s
> Disk  Random 16.0 Read                       917.66 MB/s          8.7
> Disk  Sequential 64.0 Read                   4330.72 MB/s          9.5
> Disk  Sequential 64.0 Write                  2933.63 MB/s          9.2
> Average Read Time with Sequential Writes     0.084 ms          8.8
> Latency: 95th Percentile                     0.144 ms          8.9
> Latency: Maximum                             0.278 ms          8.9
> Average Read Time with Random Writes         0.087 ms          8.9
> Total Run Time 00:00:02.97


64K BLOCKSIZE
=============

PS C:\WINDOWS\system32> winsat disk -drive I
Windows System Assessment Tool
> Running: Feature Enumeration ''
> Run Time 00:00:00.00
> Running: Storage Assessment '-drive I -ran -read'
> Run Time 00:00:00.11
> Running: Storage Assessment '-drive I -seq -read'
> Run Time 00:00:01.14
> Running: Storage Assessment '-drive I -seq -write'
> Run Time 00:00:00.75
> Running: Storage Assessment '-drive I -flush -seq'
> Run Time 00:00:00.45
> Running: Storage Assessment '-drive I -flush -ran'
> Run Time 00:00:00.42
> Dshow Video Encode Time                      0.00000 s
> Dshow Video Decode Time                      0.00000 s
> Media Foundation Decode Time                 0.00000 s
> Disk  Random 16.0 Read                       918.52 MB/s          8.7
> Disk  Sequential 64.0 Read                   4442.18 MB/s          9.5
> Disk  Sequential 64.0 Write                  2990.89 MB/s          9.2
> Average Read Time with Sequential Writes     0.086 ms          8.8
> Latency: 95th Percentile                     0.115 ms          8.9
> Latency: Maximum                             0.375 ms          8.9
> Average Read Time with Random Writes         0.081 ms          8.9
> Total Run Time 00:00:03.00

128K BLOCKSIZE
==============

PS C:\WINDOWS\system32> winsat disk -drive J
Windows System Assessment Tool
> Running: Feature Enumeration ''
> Run Time 00:00:00.00
> Running: Storage Assessment '-drive J -ran -read'
> Run Time 00:00:00.11
> Running: Storage Assessment '-drive J -seq -read'
> Run Time 00:00:01.22
> Running: Storage Assessment '-drive J -seq -write'
> Run Time 00:00:00.74
> Running: Storage Assessment '-drive J -flush -seq'
> Run Time 00:00:00.76
> Running: Storage Assessment '-drive J -flush -ran'
> Run Time 00:00:00.47
> Dshow Video Encode Time                      0.00000 s
> Dshow Video Decode Time                      0.00000 s
> Media Foundation Decode Time                 0.00000 s
> Disk  Random 16.0 Read                       873.73 MB/s          8.7
> Disk  Sequential 64.0 Read                   2704.19 MB/s          9.2
> Disk  Sequential 64.0 Write                  2791.04 MB/s          9.2
> Average Read Time with Sequential Writes     0.099 ms          8.8
> Latency: 95th Percentile                     0.137 ms          8.9
> Latency: Maximum                             0.605 ms          8.9
> Average Read Time with Random Writes         0.096 ms          8.9
> Total Run Time 00:00:03.41

As a final test:

  • compare zvol performance on virto-blk versus virtio-scsi on 64K volblocksize (NB: in my case the underlying NTFS filesystem is using the default 64K cluster size)
  • TLDR: virto-blk is still faster by about 9-10%:
VIRTIO BLK
==========

PS C:\WINDOWS\system32> winsat disk -drive G
Windows System Assessment Tool
> Running: Feature Enumeration ''
> Run Time 00:00:00.00
> Running: Storage Assessment '-drive G -ran -read'
> Run Time 00:00:00.26
> Running: Storage Assessment '-drive G -seq -read'
> Run Time 00:00:01.11
> Running: Storage Assessment '-drive G -seq -write'
> Run Time 00:00:00.66
> Running: Storage Assessment '-drive G -flush -seq'
> Run Time 00:00:00.28
> Running: Storage Assessment '-drive G -flush -ran'
> Run Time 00:00:00.28
> Dshow Video Encode Time                      0.00000 s
> Dshow Video Decode Time                      0.00000 s
> Media Foundation Decode Time                 0.00000 s
> Disk  Random 16.0 Read                       1155.78 MB/s          8.9
> Disk  Sequential 64.0 Read                   8193.54 MB/s          9.9
> Disk  Sequential 64.0 Write                  5185.20 MB/s          9.7
> Average Read Time with Sequential Writes     0.048 ms          8.9
> Latency: 95th Percentile                     0.116 ms          8.9
> Latency: Maximum                             0.324 ms          8.9
> Average Read Time with Random Writes         0.046 ms          8.9
> Total Run Time 00:00:02.75


VIRTIO SCSI
===========

PS C:\WINDOWS\system32> winsat disk -drive H
Windows System Assessment Tool
> Running: Feature Enumeration ''
> Run Time 00:00:00.00
> Running: Storage Assessment '-drive H -ran -read'
> Run Time 00:00:00.11
> Running: Storage Assessment '-drive H -seq -read'
> Run Time 00:00:01.11
> Running: Storage Assessment '-drive H -seq -write'
> Run Time 00:00:00.74
> Running: Storage Assessment '-drive H -flush -seq'
> Run Time 00:00:00.41
> Running: Storage Assessment '-drive H -flush -ran'
> Run Time 00:00:00.39
> Dshow Video Encode Time                      0.00000 s
> Dshow Video Decode Time                      0.00000 s
> Media Foundation Decode Time                 0.00000 s
> Disk  Random 16.0 Read                       1048.02 MB/s          8.9
> Disk  Sequential 64.0 Read                   4305.40 MB/s          9.5
> Disk  Sequential 64.0 Write                  2939.35 MB/s          9.2
> Average Read Time with Sequential Writes     0.078 ms          8.8
> Latency: 95th Percentile                     0.132 ms          8.9
> Latency: Maximum                             1.088 ms          8.9
> Average Read Time with Random Writes         0.083 ms          8.9
> Total Run Time 00:00:02.86
  • Hopefully these results are useful to others, for details of the special devices see Level1
  • NB: special devices cannot be removed from a RAIDZ pool - yet another reason to always use mirrored vdevs
  • During my research I saw an SLOG helps with iscsi performance on the TrueNAS forum (have hit my link limit here)
  • I’m also going to test / move my SLOG from ssd => zram (my system runs on a UPS)

I love this kind of stuff. It’s exactly what I would be doing if only I had more motivation.

Unfortunately I could follow only about 10% because I can’t deduce your setup.

What’s running on the bare metal?
Windows is a VM on this metal?
Windows had a nvme device passed thru but now it connects to storage via ??? A passed-thru KVM block device? This block device is a zvol?
LVM is sitting underneath your ZFS? You tried a QCOW2 on this ZFS filesystem with LVM underneath it?
How is the nvme device wired-up now?

The longer I look at your post the more I don’t understand. :slightly_frowning_face:

For about 4 years I’ve been running a Windows Gaming vm under Arch Linux with a pci passed through nvme + 3.5tb of a 4tb Toshiba Enterprise SATA whole disk passed through via virtio-scsi with 6 or now 8 queues configured. On the Hitman 3 Dubai benchmark I would see 200fps with a RTX 2070 super & no stutters (since I upgraded from a Ryzen 5900x => Ryzen 5950x)

I’m running out of room on the 3.5tb SATA so I bought 2 x 1.92tb SSD (intel d3 s4610 / samsung sm883) + another 4tb Toshiba sata - the plan was a 3tb SSD stripe + 7tb SATA stripe (with 1tb SATA left for Linux) - & using 2 x 100g partitions from each SSD as a special device (not configured yet) - for my main SATA mirror & the new SATA stripe

I partitioned the SSDs to separately try:

  • plain lvm as a partition

  • qcow2 on a zfs dataset

  • plain zvol (which seemed to be the fastest using the builtin winsat benchmark tool)

  • the partitioning for testing:

Partitioning (LVM + ZFS testing):
---------------------------------
SSD = 1920 GB (type 148 = zfs / type 44 = LVM)
----------------------------------------------------
part1 = 20 GB  => ZFS SLOG (Raid0)
part2 = 100 GB => ZFS SPECIAL Device ( 3 way mirror)
----------------------------------------------------
part3 = 270 GB => LVM cache (data & metadata Raid0)
part4 = 699 GB => LVM STRIPE (Raid0)
part5 = 699 GB => ZFS STRIPE (Raid0) / ZVOL & dataset
----------------------------------------------------

Last night I started copying my Steam library from SATA => SSD & saw terrible speeds - I suspect from the zvol being sparse ? The Hitman 3 benchmark worsened from 200fps => 100fps with stutters.

So I did some more testing below (TLDR: don’t create sparse zvol with -s if performance is important)

After a bit more testing I’ve got acceptable performance again (200 fps in Hitman 3)


Steam downloads

  • virtio-blk = 60mb avg / 100mb max
  • virtio-scsi (with 8 iothread queues) = the same as virtio-blk except also after a while 125mb avg / 250mb max WRITE (from queues ?) / 430mb READ (when Hitman 3 was finishing installing)

ZVOL configuration

  • When Windows creates a NTFS filesystem it does not align the disk sectors in the same way Linux does (2048 by default):
fdisk /dev/zvol/ssd1/windows

Disk /dev/zvol/ssd1/windows: 2.98 TiB, 3276544671744 bytes, 6399501312 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 65536 bytes
I/O size (minimum/optimal): 65536 bytes / 65536 bytes
Disklabel type: gpt
Disk identifier: 95A45633-38E3-4236-A04D-41EE98435A40

Device                  Start        End    Sectors Size Type
/dev/zvol/ssd1/windows1    34      32767      32734  16M Microsoft reserved
/dev/zvol/ssd1/windows2 32768 6399498239 6399465472   3T Microsoft basic data

Partition 1 does not start on physical sector boundary.
  • I created the zol as a normal device (not sparse):
  • zfs create -o volblocksize=64K -V 2.98TB ssd1/windows (the default option nowadays in zfs is compression=on which gives you lz4 compresson)
  • I partitioned the zvol manually with a single Microsoft basic data partition (i.e without with a Microsoft reserved partition of 16mb - which I don’t need as I’m not creating dynamic volumes in Windows) :
$ fdisk /dev/zvol/ssd1/windows

Disk /dev/zvol/ssd1/windows: 2.98 TiB, 3276544671744 bytes, 6399501312 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 65536 bytes
I/O size (minimum/optimal): 65536 bytes / 65536 bytes
Disklabel type: gpt
Disk identifier: 95A45633-38E3-4236-A04D-41EE98435A40

Device                  Start        End    Sectors Size Type
/dev/zvol/ssd1/windows1  2048 6399500287 6399498240   3T Microsoft basic data

virtio-scsi configuration

  • attached the zvol to virtio-scsi (add a Controller VirtIO SCSI to virt-manager)
  • add the disk as type scsi
  • set the passed through zvol as an SSD (rotation_rate='1') :
<target dev='sda' bus='scsi' rotation_rate='1'/>
  • configure queues on the virtio-scsi
<controller type="scsi" index="0" model="virtio-scsi">
  <driver queues="8" iothread="1"/>
  <address type="pci" domain="0x0000" bus="0x08" slot="0x00" function="0x0"/>
</controller>

iothread configuration

  • my CPU configuration (as it relates to the iothread)
<vcpu placement="static">16</vcpu>
  <iothreads>1</iothreads>
  <cputune>
    <vcpupin vcpu="0" cpuset="8"/>
    <vcpupin vcpu="1" cpuset="24"/>
    <vcpupin vcpu="2" cpuset="9"/>
    <vcpupin vcpu="3" cpuset="25"/>
    <vcpupin vcpu="4" cpuset="10"/>
    <vcpupin vcpu="5" cpuset="26"/>
    <vcpupin vcpu="6" cpuset="11"/>
    <vcpupin vcpu="7" cpuset="27"/>
    <vcpupin vcpu="8" cpuset="12"/>
    <vcpupin vcpu="9" cpuset="28"/>
    <vcpupin vcpu="10" cpuset="13"/>
    <vcpupin vcpu="11" cpuset="29"/>
    <vcpupin vcpu="12" cpuset="14"/>
    <vcpupin vcpu="13" cpuset="30"/>
    <vcpupin vcpu="14" cpuset="15"/>
    <vcpupin vcpu="15" cpuset="31"/>
    <emulatorpin cpuset="0-3"/>
    <iothreadpin iothread="1" cpuset="4,20"/>
    <vcpusched vcpus="0" scheduler="rr" priority="1"/>
    <vcpusched vcpus="1" scheduler="rr" priority="1"/>
    <vcpusched vcpus="2" scheduler="rr" priority="1"/>
    <vcpusched vcpus="3" scheduler="rr" priority="1"/>
    <vcpusched vcpus="4" scheduler="rr" priority="1"/>
    <vcpusched vcpus="5" scheduler="rr" priority="1"/>
    <vcpusched vcpus="6" scheduler="rr" priority="1"/>
    <vcpusched vcpus="7" scheduler="rr" priority="1"/>
    <vcpusched vcpus="8" scheduler="rr" priority="1"/>
    <vcpusched vcpus="9" scheduler="rr" priority="1"/>
    <vcpusched vcpus="10" scheduler="rr" priority="1"/>
    <vcpusched vcpus="11" scheduler="rr" priority="1"/>
    <vcpusched vcpus="12" scheduler="rr" priority="1"/>
    <vcpusched vcpus="13" scheduler="rr" priority="1"/>
    <vcpusched vcpus="14" scheduler="rr" priority="1"/>
    <vcpusched vcpus="15" scheduler="rr" priority="1"/>
    <iothreadsched iothreads="1" scheduler="fifo" priority="98"/>
  </cputune>

  • I experimented with a zram device of 5gb as a shared SLOG (I run on a UPS) - pools don’t import automatically after a reboot - so am creating a SSD stripe for the SLOG (to be used for my main SATA pool & the new SATA stripe)
  • performance is now good enough & I’ve had no problems moving everything to the new striped SSD pool:
VIRTIO-SCSI 8 queues 64K volblocksize
=====================================

PS C:\WINDOWS\system32> winsat disk -drive G
Windows System Assessment Tool
....
> Disk  Random 16.0 Read                       1187.85 MB/s          9.0
> Disk  Sequential 64.0 Read                   8900.57 MB/s          9.9
> Disk  Sequential 64.0 Write                  4473.51 MB/s          9.5
> Average Read Time with Sequential Writes     0.048 ms          8.9
> Latency: 95th Percentile                     0.061 ms          8.9
> Latency: Maximum                             12.453 ms          7.9
> Average Read Time with Random Writes         0.042 ms          8.9
> Total Run Time 00:00:02.69

  • TODO:

    • Move the Windows vm from nvme passthrough => zvol. The zvol performance is good enough / I don’t login to bare metal Windows
    • SLOG / special devices on nvme then become possible for the new SSD pool
    • Running Linux as btrfs raid1 / Windows vm as mirrored zvol on nvme also becomes possible
1 Like

In my tests I noticed if you let Windows create the NTFS partition the disk sectors are misaligned.

I also used volblocksize of 64K (to match the default NTFS cluster size) - but seem to be getting stable 97% of bare metal speeds.

In situations like this the only option is to remove layers & verify. If you can prove the zvol is getting 97% of bare metal speeds - you know the problem is in a higher layer.

512b NVME block size: ~46k IOPS, ~1700MB/s bandwidth
4k NVME block size: ~75k IOPS, ~1800MB/s bandwidth
  • setting 4k sectors seems like a free 5-10% performance improvement & gives better wear leveling (NB: this is a destructive operation & can be changed with nvme-cli from a live usb)

Years ago I had a dtrace script that would sniff out a high number of “torn” reads/writes across cluster boundaries. It made it easy to see what was truly happening across the various layers. I haven’t been able to find it again.

Also, when I was managing NetApps years ago I recall a command-line utility on the filers that would watch a volume for a few seconds then print a simple histogram of I/O sizes and whether or not they were aligned. I suspect something like that would be easy to implement for someone who knows dtrace well.