Is It Worth Using ZVOLs [Performance]

Hello everyone,

Is it viable to use ZVOLs in production environments?

In my virtual lab tests, I’ve observed significantly lower performance compared to datasheet expectations, regardless of pool configuration or block size. While overall tuning does improve performance, it still doesn’t come close to what the datasheets suggest.

All tests were conducted on FreeBSD 15 with OpenZFS (zfs-2.4.0-rc4-FreeBSD, same version in kmod). From what I’ve researched, this seems to be a widespread issue in OpenZFS, regardless of version or operating system. I’ve also come across mentions that performance was slightly better in older OpenZFS versions.

The threads I’ve reviewed on this topic include:

Up to this point, everything relates to OpenZFS. To broaden the comparison, I decided to test Oracle ZFS on Solaris. I set up a lab using Solaris 11.4 on x86 (although I understand SPARC would be the ideal platform, I unfortunately don’t have access to that hardware). To my surprise, ZVOL performance on Solaris is very similar to what I observed with OpenZFS. This leads me to think the issue may not be specific to the implementation (OpenZFS vs. Oracle ZFS), but rather something inherent to ZVOLs themselves.

Allow me a brief aside: although Solaris is a declining operating system, I think it’s still useful as a point of comparison. I also found it interesting that it supports high availability for ZFS through Oracle Solaris Cluster 4.4, which seems relevant for production environments.

Additionally, Oracle offers a storage solution called ZFS Appliance. I’ve tested its OVA, and it appears to be a highly sophisticated system with LUN support. It’s hard to imagine that a product at that level would suffer from the same ZVOL performance limitations seen in Solaris or OpenZFS.

For this reason, I’d be very interested to hear from anyone who has worked with a real SPARC-based setup and can share their experience especially regarding ZVOL performance.

Thanks in advance.

Zvols have always sucked. They never performed worth a damn, and nobody ever wants to believe me when I tell them that. Welcome to the party, pal! :beer_mug::cowboy_hat_face::+1:

2 Likes

Hi @mercenary_sysadmin

Wow, it seems that if I want to work with blocks and LUNs, it’s probably better not to use ZFS. And to use more commercial SAN solutions and limit myself to using ZFS for SMB or NFS with datasheet.

Or just use raw files as a backing. It’s all blocks regardless; zvols would appear to promise a simplification, but if that apparent simplification doesn’t result in real performance gains then it’s not really worthwhile.

But using raw files on a dataset gets you the missing performance, and you’re still working with blocks–the raw file presents a block (or character) storage device, once either loopback mounted directly on the host, or fed to a container or VM.

IMHO, block storage via FreeBSD’s ctld backed by a raw file on ZFS is solid.

I did find knobs that would improve zvol performance on both Linux and BSD. But I could never match raw file performance on either platform.

1 Like

Why are zvols slower? I don’t see the technical reason why this should necessarily be the case?

No idea. I’m reporting from empirical testing spanning decades, not from a theoretical basis gotten by looking at code. I am not a strong enough low-level programmer to feel confident diving into filesystem source code.

Hi @mercenary_sysadmin

I understand, I could generate a RAW disk with the following command:

truncate -s 100GB disk.raw

And expose it directly with the FreeBSD ctld daemon. I have some questions about managing this disk.raw. I assume that expanding it, for example.

I don’t need these disks for VMs, but rather for remote operating systems like Linux or Windows to use file systems such as ReFS or EXT4. Here, I have some doubts about how to manage the block pool and block sizes, for example.

For example, I would have to check if the block size in the datasheet is correct and how stable the system would be with a RAW file, for example, of terabytes in capacity.

Hi @adaptive_chance

I’ve been reading the ctl.conf file from the FreeBSD man page, and apparently I can directly use a RAW disk as you mentioned, even adjusting the block size, etc. This raises some questions for me, such as how to adjust the block size in CTLD or directly in the datasheet, for the file system that will contain the RAW disk.

Hi @Gerald

You can read about it in the threads I initially posted; it seems to be a problem with the Zvol’s own design. Furthermore, I don’t have the expertise to directly evaluate that claim in the ZFS code. You can see how one user improved the Zvol’s performance with some modifications, but they are still deficient. And it doesn’t seem like a good idea to work with them.

1 Like

Okay, reading the truncate manual has solved my doubt about expanding the RAW disk, please correct me if I’m wrong.

truncate -s +10GB disk.raw

I assume the + operator is enough to expand the RAW, I assume you do it that way, right?

I’m going to read about ctld and do a LAB, what I don’t know is how I’m going to make network transfers greater than 1Gbit, any idea how to do performance testing in this situation?

You can expand a raw file simply by truncating it again:

root@box:# truncate -s 1G backingfile.raw

When you need to expand it, you disconnect any mountings that point to the raw file (eg shut down any attached VM), then truncate again with a larger value:

root@box:# truncate -s 5G backingfile.raw

Then reattach your mount point. You will still need to expand the actual filesystem it’s formatted with, eg for a Windows VM you’d fire up Computer Manager, go to Disk Management, then select the existing partition with 1G NTFS partition along with the 4GiB of free space you just added, and click “expand.” In Linux or FreeBSD guests, you’d typically use gdisk to expand the partition followed by tune2fs or similar to expand the actual filesystem into the larger partition, once the partition itself has grown.

You might also choose simply to add a new partition and new filesystem to the expanded .raw–or instead of expanding the .raw you already have, creating an entirely new .raw to attach to your project.

There are essentially no real limitations here.

Raw files are very stable in multiple terabyte sizes, so long as the actual filesystem beneath them is itself stable. This is not going to be a problem with ZFS, so long as the hardware itself is up to standards.

I used it mainly as a VMFS datastore (VMware ESXi). Also had a handful of Windows hosts connecting via the Windows built-in iSCSI initiator.

After some test-and-tune with various ctl.conf settings I landed on:

blocksize 512
option pblocksize 512

…in all scenarios. I don’t recall why I chose 512/512 (512n basically) over 512/4096 (512e). I remember testing thoroughly and I have a copy of my old ctl.conf here but no notes, unfortunately. System was dismantled late last year.

I briefly tried 4096/4096 (4kn) and it didn’t go well. For this I remember bizarre ZFS write allocator behavior. It would favor one mirror vdev over the other, fully saturating the disk pair for 1-2 seconds at a time while mostly ignoring the other vdev. Then it would “stumble” and back-off nearly all write activity for another second or so. Then repeat.

Whatever went wrong might be fixed now. I worked with this setup ~1 year ago.

Come to think of it… I’ve witnessed somewhat similar on my current [Linux] desktop while experimenting with 16M recordsize. No iSCSI here; these are local SATA SSDs.


Regarding truncate – I seem to recall changing size in ctl.conf then restarting (or perhaps reloading?) ctld to make it effective. I don’t think I had to mess with the file – it resized automagically. IIRC, I didn’t need truncate at all. I could touch the file into existence then ctld would handle it going forward.

Let me know if you plan to use these ZFS-based iSCSI volumes with VMware ESXi. I have to some additional ideas for ctl.conf that might be useful.

I did a bit of testing a while ago (& am still using) a zvol for a windows vm

I used 2 x Intel Enterprise SSD’s in a stripe with a zvol on top for a Steam library. Performance has been good enough & I’ve been using it for about 18 months.

1 Like

Hi @adaptive_chance

Yes, the CTL daemon in FreeBSD has the reload option to avoid losing the service, or at least that’s what I can see in its script in rc.d. My idea is not to use it for VMware ESXi, but if you can provide me the idea, I would appreciate it. Maybe I can use it in the future for something else. I just want it for storage, as I mentioned, ReFS, ext4…

Based on the comments, I’ll rule out using Zvol for RAW files. I’ll have to study how to implement it a bit; for example, I also have doubts about the best way to configure the pool. For instance, if I use RAW files, it wouldn’t be as important to have a pool only with mirror matrices, since Zvol required them to achieve some extra performance. But anyway, I think this isn’t the right thread to discuss it; it would be better in another one.

ESXi is very touchy about a “familiar” volume showing up via a new path or with a new serial number, LUN ID, etc. By familiar I mean ESXi has mounted the volume in the past and recorded the VMFS signature.

ESXi treats these familiar strangers as snapshot LUNs and mounts them read-only out of an abundance of caution. The idea being that someone is performing a data restore from a storage snapshot therefore the “duplicate” volume needs special handling.

You can promote this volume to a normal read-write VMFS datastore (replacing the original) but it requires some CLI work and I’ve had a lot of trouble getting the promotion to reliably “stick” after restarts.

Why is this relevant?

ctld in the absence of a declaration in ctld.conf will assign a volume serial (and possibly other identifiers) with something like BSDDISK001, BSDDISK002, BSDDISK003 (I forget the exact appearance). We need these to be stable and predictable lest ESXi someday regard one as a snapshot after storage moves/adds/changes.


I don’t know exactly what ESXi looks at when detecting snapshot LUNs so I like to set a handful of identifiers to known values:

lun1 {
     path <pathspec to file/vol>
     option vendor "FreeBSD"
     option product "CTLDISK"
     option revision "1"
     option naa 0x3<15 hex digits>
     serial "<16 hex digits>" #I'd set this to match 'option naa', personally
# ctld might construct device-id automatically using the above -- need to test
     device-id "<vendorID(pad/trunc to 8 chars)><concatenated product+serial>"
     size 1500G
     blocksize 512
     option pblocksize 512
}

I see conflicting information – one source says naa must agree with device-id and the other says naa is separate. Regardless the above should get someone started.

1 Like

I have tried reproducing the supposed performance benefits of using files on a dataset instead of zvols but have failed, in fact my performance benchmarks showed the opposite.

I created a new zfs dataset named: zfs-3.6TB-mirror/vm-101-qcow2-16k
with the properties recordsize=16K, compression (zstd, inherited), atime=off

I then created an image inside empty directories in this dataset with: qemu-img create -f qcow2 -o cluster_size=16k /zfs-3.6TB-mirror/vm-101-qcow2-16k/images/101/vm-106-disk-0.qcow2 320G

qemu-img info /zfs-3.6TB-mirror/vm-101-qcow2-16k/images/101/vm-101-disk-0.qcow2

image: /zfs-3.6TB-mirror/vm-101-qcow2-16k/images/101/vm-101-disk-0.qcow2
file format: qcow2
virtual size: 320 GiB (343597383680 bytes)
disk size: 114 GiB
cluster_size: 16384
Format specific information:
    compat: 1.1
    compression type: zlib
    lazy refcounts: false
    refcount bits: 16
    corrupt: false
    extended l2: false
Child node '/file':
    filename: /zfs-3.6TB-mirror/vm-101-qcow2-16k/images/101/vm-101-disk-0.qcow2
    protocol type: file
    file length: 147 GiB (158240636928 bytes)
    disk size: 114 GiB

I then attached this disk to the VM with: qm set 101 --scsi1 vm-101-qcow2-16k:101/vm-101-disk-0.qcow2

Inside the VM is a Windows Server 2022 on an NTFS drive with cluster size 4K (~50% full). I copied the drive with a Acronis True Image 2026 Boot ISO to the new empty disk and started the Windows VM from either the zvol or the qcow2 to run CrystalDiskMark.

Performance with the zvol:


Performance with the qcow2:


Did I screw something up (apart from not having a 16K sector size on the NTFS C: drive)?
That is a quite significant reduction in performance where more performance was promised. When writing this up I noticed though that the qcow2 seems to have “zlib” compression? So I guess I need to try again without zlib in the qcow2 and I had duplicate compression? But according to the qemu man page this compression setting has no effect for live writes and is just for maintenance convert tasks. So I guess it was ok?

The whole ordeal of getting a proxmox vm with a qcow2 file to work was quite a bit of work… This really seems not to be something that is intended to be used in Proxmox. The ‘normal’ zvol I am using is configured this way:


The SCSI controller is ‘VirtIO SCSI single’

The underlying zpool is a mirror between a Western Digital Ultrastar DC SN655 ( WUS5EA176ESP7E1, U.3 PCIe) and a WD Red SN700 4000GB (M.2 PCIe).

Qcow2 files are not raw files.

truncate -s 1G onegib.raw will produce a 1GiB raw file.

1 Like

Apparently you are using a virtualization system with a format that is not RAW, as my colleague @mercenary_sysadmin mentioned. It would be interesting if you could show the performance difference between ZVOL and RAW. My requirements are only to use the CTLD daemon. I suppose that using a virtualization system, drivers, operating systems, etc., can make it more complex to achieve good performance.

1 Like