Proxmox's VolBlockSize Default is 16K for QEMU VM Disks. Why?

Hello,

I apologize if this has been asked before, but I couldn’t find it with search.

Proxmox’s default volblocksize for ZFS pools meant to store VM disks is 16k. (It used to be 8k, but they decided to double it at some point because 8k was judged to be a poor default.)

This has confused me for years. Proxmox uses QEMU/KVM, and my understanding from discussions here and reading elsewhere is that the recommended general-purpose/starting point volblocksize for those is 64k.

In fact, back when Proxmox’s default volblocksize was 8k, I always set mine to 64k (and still do).

I’m not afraid of doing something wrong here–my VMs are fast enough so far–but I’m really confused. I’m having some kind of disconnect between the PVE default (16k) and the general QEMU/KVM recommendation for ZFS … literally everywhere else (64k).

If it matters, new PVE virtual disks default to VirtIO SCSI. I use thin provisioning for my storage and store my VM disks on a mirror pool.

What am I missing?

1 Like

The dev team there just plain doesn’t believe that they have any performance issues to address.

I had proxmox users complaining for years about “ZFS being too slow” and showing me horrible benchmarks, and me being like “it’s proxmox not ZFS”, putting together a similarly constructed pool on Ubuntu or FreeBSD, running the same benchmark, and showing those folks tremendously faster results.

After years of that, I finally set up a proxmox system myself to see what the hell was going on. I discovered that it was using ZVOLs with VBS=8K, went A-HA, created a sparse file in a dataset with recordsize=64K on the same pool, ran the benchmarks and got the results I would normally expect.

I shared this with my audience on Twitter. They immediately asked me to file a bug with proxmox, so I did.

Like I said: the devs don’t believe they have a problem to solve.

2 Likes

Liberating the Twitter thread in question, since it can no longer be viewed in a standard browser without a logged in “X” account:





Sorry about the thread now being trapped in jpegs, visually struggling folks. I’m doing the best I can, but as free as the information WANTS to be, people keep on trying to freaking trap it. :enraged_face:

3 Likes

Thanks for taking the time to explain what’s going on–and for also reliving potentially unpleasant memories of your past interactions with the Proxmox devs … and Twitter people. (Aside: It’s amazing how much less useful The Former Twitter is for finding information now that I have to sign in to see most of it.)

Aside: I’ve started to put more effort into getting data I care about out of the virtual disk images I’m using. I started out experimenting, and then suddenly I was actually using the VMs to do things, and so there was suddenly production data in there. Oops. Real life keeps intervening, but my first priority is to finally, actually get my database storage onto an actual dataset in TrueNAS.

I really want to move them to datasets on TrueNAS so I can manage snapshots and everything else there–and because it’s so much easier to deal with recordsize without surprises than volblocksize.

So, my end goal is to have VMs with relatively small zVol-based boot disks, that use NFS and iSCSI and SMB to pull data from an actual properly configured ZFS-based storage server.

I’m going to have to read your messages and the twitter thread again a few more times to fully get my head around it, but I think it’s possible that someone at Proxmox quietly decided you were right and didn’t want to admit it out loud.

Maybe that contributed to them moving to 16k for the default volblocksize sometime late in PVE 7 or early in PVE 8. There wasn’t really an explanation of why except that it would improve performance. Quite a few forum users were pleased at the change, though the consensus seemed to be that it was a better default, but not a good one.

The docs at Proxmox VE Administration Guide have this to say about image formats:

Image Format

On each controller you attach a number of emulated hard disks, which are backed by a file or a block device residing in the configured storage. The choice of a storage type will determine the format of the hard disk image. Storages which present block devices (LVM, ZFS, Ceph) will require the raw disk image format, whereas files based storages (Ext4, NFS, CIFS, GlusterFS) will let you to choose either the raw disk image format or the QEMU image format.

  • the QEMU image format is a copy on write format which allows snapshots, and thin provisioning of the disk image.
  • the raw disk image is a bit-to-bit image of a hard disk, similar to what you would get when executing the dd command on a block device in Linux. This format does not support thin provisioning or snapshots by itself, requiring cooperation from the storage layer for these tasks. It may, however, be up to 10% faster than the QEMU image format. [35]
  • the VMware image format only makes sense if you intend to import/export the disk image to other hypervisors.

My VM storage is a ZFS pool, so the only option I’m allowed to pick is raw. (If I were doing storage on NFS, I’d be stuck with qcow2, I think.)

I’ve always understood that a raw “disk image” means this is a zVol, so I’ve focused on setting volbocksize to 64k. Before I read your messages above, I didn’t even realize that storing VM raw files directly on ZFS datasets was a possibility. I don’t think I’ve ever seen official documentation about that.

I discovered that it was using ZVOLs with VBS=8K, went A-HA, created a sparse file in a dataset with recordsize=64K on the same pool, ran the benchmarks and got the results I would normally expect.

I think that’s what I’m doing now? The recordsize on my pools is still the default 128k, because I didn’t know how to even guess to change it, or if it was worth it.

I added a dataset to my rpool called vmDisks64k and imported it into Proxmox as a ZFS storage “pool” (yes, this is weird; Proxmox exposes the dataset as if it were a pool in the PVE interface and lets me set a default volblocksize for any virtual disk that gets stored there). I set the default volblocksize to 64k for the pool and tried to forget about the whole thing, as I’d spent weeks trying to understand how it all fit together and flailing because Proxmox’s defaults and terminology are so strange.

All my VM virtual disks look like this when I poke at them with ZFS commands:

root@andromeda2:~# zfs get volblocksize | grep -i '64k'
rpool/data/vmStore64k/vm-99001-disk-2                                           volblocksize  64K       -

Even though some of your results in your test showed better results with different/smaller sizes, is 64k still the sanest/least bad volblocksize to use for generic Linux/Windows/BSD VMs? I think it is, but I still need to re-read your messages a half-dozen times or so. :slight_smile:

I’m putting aside for the moment the issue of how the VMs treat the virtual disks by default in Linux guests. They’re recognized as SSDs with 512 byte sector-size, and I always end up going with ext4 because it’s simple and I have no reason not to. I have no idea if it would even be worth it to try to alter the recognized sector-size inside the VM, or possible, but I feel like that’s a level of diminishing returns optimization for home/SOHO server use, in any case. I hope.

I honestly got so exhausted trying to understand volblocksize for zVols/virtual disks and what counted as a good-enough default setting that I never got around to trying to optimize recordsize for LXCs, or even figuring out if I should bother. All my LXCs are stored in datasets using recordsize = 128k.

My understanding is that proxmox simply does whatever ZFS does when you don’t specify a volblocksize directly–and that default in OpenZFS itself is what changed, with proxmox merely following suit.

I do not know if that is correct; I never followed up or investigated. But I believe it was Allan who told me that default VBS had changed in OpenZFS at about the same time it changed for proxmox, and that was the likeliest explanation.

1 Like

If it’s not the best, it’s close. This will vary a bit depending on what you’re really doing in that “generic VM” but 64K is the default recordsize I reach for in my own VMs, and I would expect either 64K or 32K to be the best choice for “generic” VM volblocksize also.

What you’re looking for is a good midway point that doesn’t penalize you too badly at either 4K or 1M random I/O. That’s what the zfs devs were already going for with the default recordsize of 128K, although many modern ZFS senior folks (including me and Allan both) tend to think 64K would have been / would be a better default. It’s certainly a big improvement for VM back ends specifically!

2 Likes