Oh, this is gonna be a fun one. Apologies, but let me paraphrase your issues for brevity and clarity:
I have a Proxmox pool of thee four-wide RAIDz1 vdevs. Whenever my VM is running, all twelve drives chatter audibly at least once per second, all day long. How come, and can I fix that?
OK, first off: your VM is probably just trickling a stream of writes to disk, for example by way of logging. And that stream of writes is getting sync’d once per second to the underlying metal.
But, why is it happening so often, and is there anything to be done about it? Yes, no, and “maybe not, but there are other problems here that you won’t be happy about,” so let’s look at those.
So, this is a mild problem, but one that these days most people think isn’t one, even if they’re aware of it. When you set up a four-wide Z1, what you’ve done is create a vdev that must use padding for every single write.
Every block (aka “record” for dataset or “volblock” for zvols) written to a RAIDz vdev must be split up evenly amongst n-p drives, where n is the number of drives in the vdev and p is the parity level.
If you’ve got a four-wide Z1, this means blocks must be split into thirds–and that won’t divide evenly, because recordsize and volblocksize alike are always a power of two. So, every single write needs padding in order to make it come out evenly, wasting both space and performance.
OpenZFS founding developer Matt Ahrens famously–and correctly, so far as it goes–argues that this isn’t as big a deal as most people make it out to be, since OpenZFS already ends up staggering things unevenly due to compression (which changes the number of sectors necessary to store a block) and metadata blocks (which generally only require a single sector, plus parity or redundancy as appropriate). So while there can be an impact, it’s not often big enough to care much about.
I generally argue the other side of this: yes, compression and metadata blocks make “perfect” vdev widths less important than they might naively be considered to be… which is not the same thing as saying it makes no difference whatsoever, and it’s generally not that hard to build a pool out of ideal-width vdevs anyway, so why be wasteful?
But the impact is rather worse here, because of how you’re using your pool. Let’s continue.
- Proxmox stores virtual machines in ZVOLs with volblocksize=8k
Matt’s advice is best applied to reasonably compressible files stored in datasets, which is not how Proxmox works. Your virtual machine is stored in a ZVOL, not a dataset–the individual block sizes are not dynamic. Compression can still cause blocks to require fewer sectors to write than normally, which does still mean even “ideal width” vdevs will often come out “uneven” in terms of on-disk layout… but there’s a caveat there as well, thanks to another Proxmox design quirk / design decision.
Proxmox, by default, uses volblocksize=8K
on all of your VMs unless you force it to do otherwise at zvol creation time. You didn’t specify what type of rust drive you’re using, or whether you manually specified ashift
, but the odds are overwhelmingly good that you’ve got drives with 4KiB sector size, and therefore ashift=12.
So: every single write your VM makes will only be two sectors wide, at maximum. And they won’t often be less than two sectors wide, because compression can’t use “partial sectors”–so if your block can compress to <=50% of its logical size, then you can store it in one data sector (plus parity), otherwise it’ll be stored in two data sectors (plus parity). But you won’t get much in the way of compression ratios, because every single block that “only” compresses to 51% of its original size will still require two sectors plus parity.
This has several unfortunate implications: first, you aren’t getting the storage efficiency you think you’re getting out of those four-wide Z1s, because since you can only write 8K blocks, you never, ever, ever actually manage to perform a full stripe write. Instead, you wind up with three-wide stripes on your four-wide vdevs, because every 8KiB write gets split into two 4KiB sectors (not three!) which get a single 4KiB parity sector to match.
This means that you’re both wasting drive space (compared to your original plan) and introducing a lot more individual drive writes (with corresponding platter chatter), since every write is non-optimal, you get very little compression, and you’re committing all writes in teeny tiny bite size chunks.
- Well, shit. Now what? How can I make this less worse?
First of all, I would strongly recommend rethinking your pool layout. If you put those same 12 disks into four 3-wide Z1 instead of three 4-wide Z1, you’ll get more efficient reads and writes, because now you’ll actually be writing three-wide stripes to three-wide vdevs, and will be able to perform 25% more writes per second (since they’re essentially the same three-wide writes, but this time onto four vdevs total instead of three).
Normally, the argument against this would be “but I want 75% storage efficiency instead of 67% storage efficiency”–but this is Proxmox, so you’re working with zvols, not datasets, and you’re working with tiny individual blocks, so you actually were already at only 67% (maximum) storage efficiency, and just didn’t realize it.
Narrowing your vdevs from four-wide to three-wide also improves your system’s ruggedness, because now there are only three points of failure in each of your single-SPOF-tolerant vdevs, instead of four points of failure in each. Win, win, win.
- That doesn’t sound like it will make my drives chatter less, though…?
No, probably not. And you may never be entirely happy with this, if you don’t like hearing your drives chatter. But you can do a couple of things to minimize it.
Proxmox’s default to 8K volblocksize is, frankly, wildly inappropriate for nearly all the use-cases Proxmox is commonly put to. It would make sense for a dedicated PostgreSQL database engine, because PostgreSQL uses 8KiB page size in its db storage–so you’re matching like to like, for the most optimal read/write patterns for that workload.
But it’s a much smaller than optimal blocksize for nearly everything else–even MySQL and MSSQL databases generally shouldn’t have a volblocksize that low, since MySQL (InnoDB) defaults to 16KiB pages, and MSSQL defaults to 64KiB extents (each extent is eight 8KiB pages, and I/O is overwhelmingly by extent, not by page, in MSSQL). And that’s just databases–volblocksize=8K is, frankly, pants-on-head stupid for general purpose applications like file servers!
Now, for what you’re really asking for–quieting your noisy rust platters–datasets and QCOW2 or raw storage files are the best way to go. But Proxmox will make that pathologically difficult in its own UI, so I can’t really recommend that for you here.
- Uh, that was all things that suck. Where’s the thing that might NOT suck?
If your workload is typically larger than 8K per I/O, you can relatively trivially convince Proxmox to create the zvols for your VM with a larger volblocksize–although you need to do that at creation time; it’s immutable for the zvol once set (so you can’t just convert your existing VMs in place).
If you’ve got a general-purpose Windows desktop machine for a VM, for example, I’d typically recommend volblocksize=64K
for it. This allows for much greater compression (asssuming 4KiB sectors, you can potentially realize actual compression on-disk even if a 64KiB block only compresses down to 60KiB!) as well as fewer of those noisy seeks, since 64KiB of data only needs to get written to a single stripe, rather than to eight stripes.
You also want to adjust your pool topology to match. Remember that recommendation for three-wide Z1? If your vdevs are three wide, and your blocks are 64KiB wide, that means each 64KiB write gets split nice and evenly into 32KiB of data on one drive plus 32KiB of data on another drive plus 32KiB of parity on the final drive in that stripe, rather than a bunch of higgledy-piggledy nonsense.
- Can I minimize seeks even further, with an even more aggressive topology change?
Absolutely! Ditch the Z1 and go to mirrors, and now you’re only forcing a seek on two drives per write instead of three drives per write. You’ll also get significantly higher performance and faster resilvers out of the deal.
- I WANT EVEN MORE OPTIMIZATION!
Okay, so now we want to look at not just doing “one big C: drive” on your VMs, but actually giving them multiple drives and storing different data on different drives inside the VM.
Let’s say that I had it right earlier, and your one VM is a general-purpose Windows desktop. Let’s also say you want to store several TiB of bulk data–mostly movies, music, and photos–on your Windows VM.
So in this case, you leave the C: drive of your VM at volblocksize=64K
as I recommended earlier, to get you improved performance and efficiency with fewer seeks, and without introducing too much latency on smaller I/O. But now you add a D: drive for your music, movies, and photos–and on your D: drive, you set, say, volblocksize=256K
. This reduces the number of seeks necessary for each I/O operation by another factor of eight on your large bulk data!
You probably wouldn’t want to do that on your C: drive, because your C: drive needs to manage a lot of small block I/O operations ranging from log streaming to database and database-like operations (SQLite, the Windows registry, you name it), and large block size with small I/O means poor latency–the last thing you want is to make your desktop feel unresponsive because you increased the effective application latency on its C: drive.
So, you want to do a little experimenting, ideally, and find the sweet spot for your own workload. You might discover that volblocksize=64K isn’t quite low latency enough for your tastes on the C: drive, so you set it to 32K instead–still giving you a 4x factor for relief from the seeks you’re seeing now. And maybe you discover that you don’t have any percpetible latency issues on your D: drive, so you pump its volblocksize all the way up to 1MiB–which will both improve throughput and essentially eliminate platter chatter, since every single I/O to D: will now split 256 total contiguous sectors between your drives, instead of only two. (Assuming three-wide Z1, this means each disk reads and writes in contiguous 128 sector chunks, rather than each disk reading and writing individual sectors with each individual I/O operation).
- Well. Um. Crap. I think I need a nap now!
Me too, friend, me too. But I hope this helps you plan the next stages of your journey. =)