Tuning ZVOL for UFS Guest

RobIsAThing · November 8, 2023, 1:18pm

Hi everyone,

Ive been playing with ZFS concepts for a while, but gradually moving to more real world usage of it, so I guess id still call myself a beginner.

To set the scene, im currently running a proof of concept in my homelab utilizing ZFS as the backing storage for an Ubuntu LXD environment, which allows for provisioning KVM based VMs and full OS containers. Default behaviour for spinning VM’s up here is to provision a ZVOL, and I dont believe QCOWs are a supported option.

My endgame is to have FreeBSD VMs running on this, so I have been trying to come up with some sound tuning options for this.

FreeBSD out of the box will support UFS and ZFS. Im leaning more toward UFS for the guest filesystem due to concerns about how nested ZFS would perform, but maybe this is unfounded, so input is welcome on that point.

Assuming I run with UFS for the guest, from reading man pages, it seems there are 2 tunables at the guest side: block size, and fragment size. Defaults out of the box are 4k fragment to 32k block, following a recommended 8:1 ratio. (referencing newfs(8) )

For the ZVOL, would I be looking to match up the volblocksize to the guest fragment size of 4k or block size of 32k for best results?

Also, ive read in a couple of places that primarycache=metadata can be beneficial on the host to allow guest OS to handle their own caching, does this still hold?

My test pool is a simple single disk nvme stripe, 4k native blocks so ashift=12. If all goes well the plan will be moving this to compute nodes with 2 x nvme mirror vdevs by default.

Thanks in advance for any feedback or advice on this one.

mercenary_sysadmin · November 8, 2023, 5:05pm

For the ZVOL, would I be looking to match up the volblocksize to the guest fragment size of 4k or block size of 32k for best results?

I recommend volblocksize=64K or volblocksize=32K for generic workloads. Essentially, you’re going to be stuck with static tuning instead of dynamic, since you’re using a zvol, so you want to hit something that’s middle-of-the-road. 64K is the most common “generic” VM blocksize, and for the same reason.

You might consider 32K if you are still targeting a generic workload (not something specific, like MySQL) but suspect you’ll be more concerned with latency than throughput. Again, 64K is the more common choice, but 32K might be worth at least considering. (I would not recommend 128K for a generic workload; that’s what ZFS uses by default for recordsize on datasets, and while it’s “good enough” for most use cases, I’d overwhelmingly prefer 64K when there’s no dynamic adjustment available.)

Also, ive read in a couple of places that primarycache=metadata can be beneficial on the host to allow guest OS to handle their own caching, does this still hold?

There are two schools of thought on this: either you give the guest a bunch of extra RAM and expect it to do its own filesystem caching, or you give the guest just barely enough RAM to run its apps, and you handle its filesystem caching from the host level. I generally prefer host-based caching, but here are the pros and cons:

Guest-based caching: since the cache is in the guest’s own RAM, you don’t need to do context switching from guest to host in order to fulfill requests from cache. This makes the cache returns lower latency.

Host-based caching: since you handle caching on the host level instead of inside the guests, it’s persistent across guest reboots–this can make restarting after a Windows Update or what have you enormously faster than it would be with guest-based caching. Also, the host has a much better idea than the guest does of which guests need caching the most, and will adapt accordingly–so instead of eg having 1GiB of cache for each of four guests, when only one guest has any significant disk activity, a host-based cache system would concentrate the majority of that 4GiB of cache RAM to the more active guest that can actually make good use of it.

Personally, I almost always go for host-based caching. Any cache hit is a huge win even if it is slowed down by context switching, and you get far more cache hits from host-based than guest-based caching, when you’ve got multiple guests to choose from. So I’d rather have lots more hits, than have a smaller number of hits that return a bit quicker.