ZFS Performance Question: Proxmox (ZFS (KVM (Docker (PostgreSQL))))

Long story short, I want to run a high IOPS PostgreSQL dependent database in a Docker container inside a KVM VM on a ZFS filesystem consisting of NVME SSDs. All this would run on Proxmox. What settings should I choose for high performance? I have been using ZFS for a while and know the basics, but this setup confuses me.

The setup: Proxmox(ZFS(KVM(Docker(PostgreSQL))))

Of course I should run benchmarks, but please give me your best guess. This is what I have so far:

ZFS Settings

  • Mirror VDEVs
  • Set ashift=12 (are SSDs better with ashift=13?)
  • Set recordsize and volblocksize to 16k
  • Set compress=lz4
  • Set atime=off

VM Settings

  • SCSI controller: VirtIO SCSI Single
  • Enable iothread
  • Enable Discard for VM and within VM

My questions:

  1. Are my settings a reasonable choice for performance?

  2. Would ashift=13 be better for modern SSDs? Please explain.

  3. Is my 16k recordsize a good choice? I know PostgreSQL works best with 16k (please correct me if I am wrong), but what about VM image size? Will Docker affect my choice? This is the most uncertain thing in my setup for me. Please explain.

  4. What would change if I don’t run a dedicated ZFS pool for my PostgreSQL database VM, but store the VM image directly on the Proxmox root pool (zroot)? Of course, the VM would have to share performance with other VMs. But what else?

  1. Yep, reasonable choice. The only thing I might change would be creating a dataset and using a sparse file on a dataset, rather than letting Proxmox use a zvol. Zvols sound like they should outperform raw files on datasets, but in my experience, they do not. 16K is a good choice, though, for either recordsize or volblocksize.

  2. ashift=13 might or might not be better. Given that you’re working with NVMe, it’s a crapshoot whether ashift=12 even outperforms ashift=9. You really need to benchmark this with fio instead of making assumptions, if you have not already done so with your specific NVMe SSD models. You’ll want to test, ideally, with something like 75/25 write/read 8KiB random I/O, because that’s roughly the storage access pattern you’ll probably see in Postgres.

  3. Yes, again, 16KiB is a good choice for recordsize/volblocksize. Postgres actually does 8KiB pages by default, but doubling that is usually a win–what you lose in minor read amplification, you gain in better compression ratio, and it’s not exactly uncommon to wind up needing the next consecutive 8KiB block on any given read anyway.

  4. Hold up, did you say separate pool? As in, your Proxmox lives on one pool, but you also create a second pool, and only your VM goes on that second pool? Or am I misunderstanding? If I’m not misunderstanding, then yes, using an entire separate pool ensures all IOPS on that pool go to your VM. If I am misunderstanding, you don’t really gain or lose anything, performance-wise, by having other things in the Proxmox root dataset instead of in a child dataset or zvol–but I still wouldn’t do that, because that’s not enough logical isolation for things like “I want to roll back my VM without rolling back my host”. Either way, now that we’re asking weird questions about (potentially) multiple pools… you’re not talking about putting a second pool inside the VM itself, are you?

  5. Docker inside the VM really doesn’t change any of this in any way.

1 Like

First of all, I would like to express my appreciation for such a great response. This is unfortunately not common on many forums these days. Thank, thank, thank you!

I think I understood that. But just to double check, you really mean raw and not qcow2 or something like that line, right?

Coming from an overclocking gaming background, I understand that benchmarks are the only real thing. But at the same time, I also know that there are bad benchmarks out there that have no real meaning. So what would an optimal fio command look like for my setup?

I know your “How fast are your disks? Find out the open source way, with fio” article, but I am not very good at modeling your recommendation.

fio --name=db-test --ioengine=posixaio --rw=randwrite --bs=8k --size=512m --numjobs=16 --iodepth=16 --runtime=60 --time_based --end_fsync=1

Like this?

For clarification, there are two different options.

  • Option 1: VM image on the Proxmox zroot pool.
  • Option 2: Proxmox host, but two separate NVME SSDs with PCI passthrough directly to the VM. Then a dedicated ZFS installation (mirror) on these two NVME SSDs.

Now, you are confusing me. The idea was two different pools, like option 2 in my clarification. But a dataset is not a pool. So just to be clear, using a zvol or a dedicated dataset within the zroot Proxmox pool is fine for rollback of the Postgre VM, right?

Yes, that was my option 2 idea. Is this a good idea or a bad idea?

Good to know, thank you!

Yes, raw files is what I meant, but passing raw drives to your VM rather than either a sparse file or a zvol is likely your best option for maximum performance. Though it does raise questions about why you’re virtualizing at all, instead of running on bare metal.

1 Like

What about the fio command? Can you please help me here as well?

What would be the best bare metal setup in terms of cloud provider availability? Is there a Linux distribution iso with an out-of-the-box ZFS setup? The OpenZFS docs for ZFS on root are way too complicated for many cloud providers.

In my experience, without an iso and a simple out-of-the-box installer, many cloud providers don’t really handle/know ZFS well.

This is fine, if sixteen parallel processes are what you want to optimize for, but you probably want to mix random read and random write. 75% write, 25% read is the usual rule of thumb for this.

fio --name=db-test --ioengine=posixaio --rw=randrw --rwmixread=25 --bs=8k --size=512m --numjobs=16 --iodepth=16 --runtime=60 --time_based --end_fsync=1

This is the same command you were using, but changing from pure random write to random read/write with a 25(read)/75(write) ratio.

You might also consider messing with compressibility, if you want to benchmark the impact of that. fio will let you specify the compression ratio of the pseudorandom data it uses, for example adding the arguments:

--buffer_compress_percentage=50 --refill_buffers --buffer_pattern=0xdeadbeef

Would give you data that’s roughly 50% compressible. (You might not achieve that much, with smaller recordsizes–which is one reason to test this way. You can also get an idea of the impact of compression in terms of both throughput and latency this way, and it’s not always a given what will or won’t favor a database workload.)

1 Like

Amazing! Thanks, I was not aware of the --rwmixread=25 option. Thank you so much! I think you answered every open question!

Just out of curiosity, is there a Linux distribution that I could throw at a ZFS-unaware cloud provider to easily install a ZFS root system for me? I know that ZFS is available in the Ubuntu 24.04 desktop installer, but it does not seem to be available for their server version. If I had access to an IPMI module, I could follow the OpenZFS ZFS on Linux installation recommendations, but this is not always the case. So how can I make sure I don’t end up with the typical cloud provider MDADM and LVM setup?

If your cloud provider won’t give you a way to do your own distro install from the ground up–which might be IPMI, or KVMoIP, or console, or literally just ask their techs to do it the way you want them to–you might be out of luck.

With that said, just about every provider I’ve ever used has given me SOME way to accomplish install-time goals.

The other alternative is simply to tell them “one 100GiB partition you feed to mdraid, you make that mdraid drive the root file system, and you don’t touch any of the rest of my disks.” You don’t get ZFS on root, but if you’re doing all your real work inside VMs, that doesn’t really matter much, because there isn’t anything significant to the host itself worth backing up in the first place.

A final thought; if you can get the cloud techs to do a custom install for you but don’t think they’ll follow anything as complex as the zfsbootmenu (btw, this is totally the way to go–not just the openzfs docs, but zfsbootmenu specifically) docs, you might try using a Ubuntu and having them use this semi-automated installer that “Sithuk” created: GitHub - Sithuk/ubuntu-server-zfsbootmenu: Ubuntu zfsbootmenu install script

I’ve done a couple zfsbootmenu installs that way, and they worked pretty well with significantly fewer console commands to enter. I wouldn’t specifically recommend that you personally use that rather than the standard zfsbootmenu way via their docs, if you’re capable of the latter… but if your only option is hoping random techs you’re not too sure of can make it through (or if you don’t think you can make it through the standard zfsbootmenu docs), it might be worth a shot.

1 Like

There’s still a lot, and I mean a lot more you can learn about fio if you’re really interested in getting good at this stuff.

I am not about to walk you through the entire syntax for this, but you might for example, in a single fio session with reporting on all portions of the job as well as aggregate:

  • 4K random reads, limited to 5MiB/sec
  • 1M random asynchronous writes, limited to 10MiB/sec
  • 16K random synchronous writes, limited to 2MiB/sec
  • 64K random read/write, 25% read, unlimited speed
  • 128K random read/write, 7% read, unlimited speed

Like, all that can be run, I repeat, not just one after another but in a single fio run! Why would you want to do that? Well, if you really, really understand your workload, you can make fio emulate it extremely closely by managing multiple I/O tasks, throttling some and perhaps not throttling others (to keep, eg, 4K random reads from overwhelming every other task due to IOPS starvation if allowed to run unthrottled), fine tuning data compressibility, controlling how often the writes on a task are synchronous vs asynchronous… you get the idea.

With advanced use, you’ll also learn that while throughput in MiB/sec is a smexy number everybody thinks they understand, those scary latency numbers are where the real information (and potential pain) generally live. For example, a desktop machine that can support 10GiB/sec of random reads and asynchronous writes, but has a 64K latency spiking up into the milliseconds will be miserable to use, where an otherwise identical desktop machine that can’t ever break 20MiB/sec throughput–but never offers worse 64K latency than a few nanoseconds–will feel much faster.

1 Like

Yes, I feel the same way. Unfortunately, some cloud providers aren’t really ready for ZFS yet.

I had the same idea. Like the typical MDADM + LVM mirror host OS installation and two+ disks for the actual data with ZFS. But this solution requires two more disks that I do not really need. Or looking at it from the other side, an additional ZFS mirrored VDEV that could increase the performance lost due to the split.

I’m not familiar with the Ubuntu installer. Thanks for the hint, I will check it out. However, I know ZFSBootMenu (I read the OpenZFS docs) and I like the project too. But as you said, I have the exact same feeling that you cannot expect cloud devs to set up ZFSBootMenu.

So why isn’t ZFS already a one-click solution? Is it licensing? I would understand the lack of ZFS support if Linux had something like this by default. Then ZFS would just be a flavor choice that the cloud and enterprise software would not have to respect. But what else is there besides ZFS? Complicated LVM + MDADM + LUKS setups or buggy BTRFS. But neither is really comparable to ZFS. I would love it if there was just one iso that ran the Ubuntu, Debian, Fedora (at this point I am really not picky) installer, but with a default ZFS mirror setup.

Thanks again for your time and information! I have learned a lot and best of all you have pointed me to good projects to research myself.

I understand the power of fio in terms of actually modeling a realistic workload. Even though I am just a home lab enthusiast and not an enterprise datacenter operator, I will definitely look into it and dig deeper.

As an additional question, do you think good default NVME support will be a feature of ZFS in the near future? I understand that tunables on enterprise systems like ZFS are necessary and great, but reasonable defaults would be great too. With that, I would expect ZFS to be much more widely adopted.

Not necessarily! Notice I said partitions, not disks. If your provider gives you a pair of 1TB SSDs, you can ask the provider to do an mdraid1 install on 100GiB partitions, leaving the rest of the drives untouched. In my experience, that’s not a difficult ask at any of the cloud providers, one way or another.

So you’ve then got a 100GiB root on mdraid1/ext4, and you can create a 900GiB partition (or whatever) on each of the underlying drives, to then create a 900GiB (or whatever) zpool on a mirrored vdev.

So why isn’t ZFS already a one-click solution? Is it licensing?

Largely, yes. It took a while before ZFS got the degree of support you’re hoping for even on FreeBSD, where there is no licensing conflict–but FreeBSD has been there for years now, while over on the Linux side only Canonical is even giving it a shot, very much due to licensing. Canonical made the bold decision–which I applaud them for–to say “we think this is legally quite defensible, and if you disagree, we’ll see you in court” and nobody has challenged them in court for the several years they’ve taken that stance. But nobody’s followed them, either, and there are several reasons not to (the one which makes the largest difference to me being that any attempt to sue would necessarily weaken the GPL even if it fails).

Yes. NVMe is still pretty new, and essentially what you’re seeing is that ext4 and other conventional filesystems don’t need as much adjustment to optimize for NVMe, not because of any inherent superiority, but simply because they don’t do as much on the CPU, so they don’t contribute to CPU bottlenecking as much.

OpenZFS does quite a bit more computationally than legacy filesystems do. Before NVMe, that didn’t matter much–the increase in CPU utilization per GiB read or written wasn’t enough to impact storage performance at all, or compute performance much. But decent NVMe devices can easily saturate a CPU thread even with very simple I/O, so the additional computation ZFS does has a significant impact.

There’s quite a bit of work going on right now to optimize ZFS for NVMe, and I think it will get considerably better. It may never be quite as fast as legacy filesystems on fast NVMe, though, for the simple reason that the benefits ZFS brings aren’t entirely free, computationally speaking.

1 Like

This is an interesting idea! I had not thought of it because I thought ZFS should not be used on partitions, but on the whole disk. However, to be honest, this is a belief for which I have no technical justification. Just something I picked up from the Internet. Are there any valid concerns about this?

Interesting, seems we are back to the days when Linux ISOs did not include proprietary drivers for fear of lawsuits.

That makes sense. I hope ZFS picks up NVME soon. I agree with you that ZFS does more than just store files, so it is not a fair comparison to ext4, etc.

Not really. You’d tank performance if you created partitions with improper offsets, but you’d practically have to write your own partitioning tools to even manage that; the standard tools automatically fix this for you, and have for decades.

I’ve deployed hundreds of machines set up exactly as I described, using 100GiB mdraid1 for root and the rest for ZFS.

1 Like

Thanks! And then you would run a VM that stores its raw image in the Dataset inside the ZFS side of the drive? Theoretically, it should also be possible to mount the Docker dirs, but I would like to have the option to replicate the entire environment (the OS, configs, data).

Correct, I create one or more datasets for each VM beneath that pool. Specifically, each virtual drive gets its own dataset, so they can be independently snapshotted, rolled back, etc.

1 Like

And how would you handle encryption? My first choice would be native ZFS encryption, even if it leaks some metadata, instead of VM LUKS filesystem encryption. What are your thoughts?

Whichever works best for you, really. I personally would probably go ZFS native in order to get the advantage of compression, but that does come with a small amount of additional risk because it’s newer and less well-tested code than LUKS.

1 Like

Thank you for your excellent and detailed answers!

1 Like