PSA: RAIDz2 + Proxmox + efficiency + performance

mercenary_sysadmin · July 24, 2024, 1:34am

Hi folks. This technically applies to everybody, but Proxmox creates the perfect storm where it keeps biting their users in the ass, so I’m going to talk about it in here:

I keep seeing Proxmox users wondering why they don’t see the storage efficiency they expect out of RAIDz2 pools. You’re never going to get the storage efficiency you’re expecting from striped-parity topologies if you run Proxmox with default settings! Here’s why:

Proxmox uses ZVOLs
Proxmox defaults to VERY small volblocksize (8K until the most recent version, 16K now)
When storing a block on a RAIDz vdev, OpenZFS divvies it up into equally sized pieces. On a zvol, those blocks are always volblocksize in size–which, remember, was 8K until the most recent version of Proxmox, and is 16K now.
The majority of drives these days use 4KiB sectors, and you can’t have an individual piece of a block smaller than one sector

With me so far? Okay, so what happens when you ask Proxmox to store an 8K block on a six-wide RAIDz2 vdev? Well, it splits that 8KiB block into two 4KiB sectors, which it stores on drives 0 and 1… and then it creates two sectors of parity for that stripe, which it puts on disks 2 and 3, adding up to four sectors needed to store two sectors of data. Disks 4 and 5 aren’t written to at all for this block!

In other words: 50% storage efficiency, literally the same you’d expect from a pool of mirrors but with much lower performance.

With a volblocksize of 16KiB, you can achieve the 67% storage efficiency you’d expect out of a six-wide Z2, because one 16KiB block divvies up into four 4KiB sectors and two 4KiB parity sectors, therefore 4/6 == 2/3 == the 67% you were expecting… but it still won’t get you the 80% you’d expect out of a 10-wide Z2.

Keep in mind, you also see performance problems with these poorly-chosen topologies, too. A six-wide Z2 writing 8KiB blocks to a zvol is doing the same essential workload as a simple pool of mirrors would, but it does it much less optimally, resulting in far lower performance. (It does still enjoy dual parity, but you’d be better off just using 4-wide Z2 if you wanted the dual parity and are willing to tolerate 50% SE.)

And this is without going into the performance advantages the pool of mirrors has over a properly configured and optimally-used RAIDz pool with the same number of drives! If you’re setting up for VMs, I strongly recommend using mirrors instead of RAIDz–even if you’re using SSDs, but especially on rust. You just don’t have the IOPS to spare.

mercenary_sysadmin · July 24, 2024, 1:39am

Incidentally, if anybody is wondering why you get much better performance out of a six-wide pool of mirrors than a six-wide Z2 running an 8K zvol, it’s because the pool of mirrors has a dedicated configuration that never ties up disks in vdev 1 with disks in vdev 0–so it never binds on disks 2, 3, 4, or 5 while waiting to produce data from disks 0 or 1, whereas the Z2 will sometimes need the same disk for two separate blocks.

On top of that, the Z2 still has to read from two disks before it can produce 8KiB of data–whereas the mirror only has to read from one. This means lower latency per IOP, and it also means double the total read IOPS, since two blocks coming from vdev 0 may be served simultaneously (one from drive 0, and the other from drive 1).

All of this adds up to a MAJOR difference in performance that you can feel in the seat of your pants, let alone with benchmarks, when you move from RAIDz to mirrors.

Does this mean there is no place for RAIDz? No, of course not. There are quite a few workloads that suit themselves well to lower performance, higher storage efficiency (if properly architected), and dual redundancy all in one package… but you should know what you’re giving up when you choose that architecture, to make sure you’re selecting it for the right workloads when you do choose it.

waltar · July 24, 2024, 7:56pm

It’s more a “Proxmox could use ZVOLs” as you are free to decide. Even interesting to read here again this:
https://jrs-s.net/2018/03/13/zvol-vs-qcow2-with-kvm/
https://jrs-s.net/2016/06/16/psa-snapshots-are-better-than-zvols/
https://jrs-s.net/2017/03/15/zfs-clones-probably-not-what-you-really-want/
I (/we) prefere nfs storage for our proxmox nodes with the vm’s and containers and that doesn’t mean that way to not able to use zfs there (with all the recordsize, snapshot etc stuff).

irate.overlord · August 9, 2024, 7:42pm

Should Proxmox use ZVOLs?

Also if drives are using 4K sectors, what block size should we use for mirror arrays?

Does this 6-disk RAIDZ2 problem still exist if we don’t use ZVOLs?

mercenary_sysadmin · August 9, 2024, 8:04pm

No, I really don’t think it should–zvols seem like the ideal answer for VM storage on paper. But when you test them, their performance is disappointing as hell.

Also if drives are using 4K sectors, what block size should we use for mirror arrays?

Depends on your workload. I’d typically recommend 64K for general-purpose workloads with flat-file (raw or qcow2) storage; for zvols I’m a little less confident, but I expect if 64K isn’t the best answer, 32K will be.

For specialty workloads, you may want to go smaller–for example, MySQL should generally have a 16K volblocksize. For massive workloads, you want to do advanced tuning with multiple drives–eg a MySQL server with a root filesystem on 64K blocksize, a separate virtual drive with 16K volblocksize for the innodb stores, and possibly even a THIRD virtual drive with very large volblocksize (maybe 256K or 512K) for the streaming logs.

Does this 6-disk RAIDZ2 problem still exist if we don’t use ZVOLs?

This largely depends on whether you’re still using very small blocksizes. You still can’t split up a 16KiB block into more than four (individually very small, which means terrible performance) pieces, or an 8KiB block into more than two.

Even if all of your data can be written cleanly to the six drive Z2–for example, a streaming workload going onto a zvol with volblocksize=256K, so you’ve got 64K per block per drive–the same number of drives in mirrors will absolutely smoke the Z2 performance wise.

There are a very, VERY few workloads that allow the total disk count in a Z2 vdev to produce significantly increased performance compared to a single drive, and a massive array of workloads that produce worse performance in a single Z2 vdev than they would in a single disk of the same type.

This is because performance generally scales with IOPS, and the IOPS of a Z1/Z2/Z3 vdev–or of conventional RAID5/RAID6–are actually less than the IOPS of a single disk, because the array binds on the slowest disk to complete any given operation, even if it’s not the same disk with every op. By contrast, a mirror vdev has nearly the same write IOPS as a single drive and nearly double the read IOPS of a single drive.

If you (over)simplify to get a general idea; you come out with theoretical IOPS of a six-wide Z2 at 1.3n read / 1n write (where n is the IOPS of a single drive) compared to 6n read / 3n write IOPS for three two-wide mirror vdevs.

With that said, not everybody needs the performance of mirrors–if you’re mostly pulling storage across a 1Gbps LAN, you are probably bottlenecking on the network, not the storage, for a lot of workloads. But if you do need the performance, the difference between six disks in a single Z2 vs six disks in 3 mirrors is not subtle, it’s a very obvious kick in the pants you don’t need benchmarking utilities and careful measurement to see.

irate.overlord · August 9, 2024, 9:28pm

Thank you for the information.