Hard Drives in zfs pool constantly seeking every second

Just finished building a Proxmox server with my first HDD array. Added a Debian VM with access to a mount point on the pool.
While the VM is running, I can hear the read/write arms of all 12 HDDs in sync doing a seek every second, all day long. This stops when the VM is stopped. As far as I can see the system itself is idle.
Is there any reason for this continual seeking? Any way to safely stop it from happening?
Is this perhaps an issue in the way I created my pool?

I created via command line:
zpool create tank raidz1 /disk/SN1 /disk/SN2 /disk/SN3 /disk/SN4 raidz1 /disk/SN5 /disk/SN6 /disk/SN7 /disk/SN8 raidz1 /disk/SN9 /disk/SN10 /disk/SN11 /disk/SN12

2 Likes

Oh, this is gonna be a fun one. Apologies, but let me paraphrase your issues for brevity and clarity:

I have a Proxmox pool of thee four-wide RAIDz1 vdevs. Whenever my VM is running, all twelve drives chatter audibly at least once per second, all day long. How come, and can I fix that?

OK, first off: your VM is probably just trickling a stream of writes to disk, for example by way of logging. And that stream of writes is getting sync’d once per second to the underlying metal.

But, why is it happening so often, and is there anything to be done about it? Yes, no, and “maybe not, but there are other problems here that you won’t be happy about,” so let’s look at those.

  • four-wide RAIDz1 vdevs

So, this is a mild problem, but one that these days most people think isn’t one, even if they’re aware of it. When you set up a four-wide Z1, what you’ve done is create a vdev that must use padding for every single write.

Every block (aka “record” for dataset or “volblock” for zvols) written to a RAIDz vdev must be split up evenly amongst n-p drives, where n is the number of drives in the vdev and p is the parity level.

If you’ve got a four-wide Z1, this means blocks must be split into thirds–and that won’t divide evenly, because recordsize and volblocksize alike are always a power of two. So, every single write needs padding in order to make it come out evenly, wasting both space and performance.

OpenZFS founding developer Matt Ahrens famously–and correctly, so far as it goes–argues that this isn’t as big a deal as most people make it out to be, since OpenZFS already ends up staggering things unevenly due to compression (which changes the number of sectors necessary to store a block) and metadata blocks (which generally only require a single sector, plus parity or redundancy as appropriate). So while there can be an impact, it’s not often big enough to care much about.

I generally argue the other side of this: yes, compression and metadata blocks make “perfect” vdev widths less important than they might naively be considered to be… which is not the same thing as saying it makes no difference whatsoever, and it’s generally not that hard to build a pool out of ideal-width vdevs anyway, so why be wasteful?

But the impact is rather worse here, because of how you’re using your pool. Let’s continue.

  • Proxmox stores virtual machines in ZVOLs with volblocksize=8k

Matt’s advice is best applied to reasonably compressible files stored in datasets, which is not how Proxmox works. Your virtual machine is stored in a ZVOL, not a dataset–the individual block sizes are not dynamic. Compression can still cause blocks to require fewer sectors to write than normally, which does still mean even “ideal width” vdevs will often come out “uneven” in terms of on-disk layout… but there’s a caveat there as well, thanks to another Proxmox design quirk / design decision.

Proxmox, by default, uses volblocksize=8K on all of your VMs unless you force it to do otherwise at zvol creation time. You didn’t specify what type of rust drive you’re using, or whether you manually specified ashift, but the odds are overwhelmingly good that you’ve got drives with 4KiB sector size, and therefore ashift=12.

So: every single write your VM makes will only be two sectors wide, at maximum. And they won’t often be less than two sectors wide, because compression can’t use “partial sectors”–so if your block can compress to <=50% of its logical size, then you can store it in one data sector (plus parity), otherwise it’ll be stored in two data sectors (plus parity). But you won’t get much in the way of compression ratios, because every single block that “only” compresses to 51% of its original size will still require two sectors plus parity.

This has several unfortunate implications: first, you aren’t getting the storage efficiency you think you’re getting out of those four-wide Z1s, because since you can only write 8K blocks, you never, ever, ever actually manage to perform a full stripe write. Instead, you wind up with three-wide stripes on your four-wide vdevs, because every 8KiB write gets split into two 4KiB sectors (not three!) which get a single 4KiB parity sector to match.

This means that you’re both wasting drive space (compared to your original plan) and introducing a lot more individual drive writes (with corresponding platter chatter), since every write is non-optimal, you get very little compression, and you’re committing all writes in teeny tiny bite size chunks.

  • Well, shit. Now what? How can I make this less worse?

First of all, I would strongly recommend rethinking your pool layout. If you put those same 12 disks into four 3-wide Z1 instead of three 4-wide Z1, you’ll get more efficient reads and writes, because now you’ll actually be writing three-wide stripes to three-wide vdevs, and will be able to perform 25% more writes per second (since they’re essentially the same three-wide writes, but this time onto four vdevs total instead of three).

Normally, the argument against this would be “but I want 75% storage efficiency instead of 67% storage efficiency”–but this is Proxmox, so you’re working with zvols, not datasets, and you’re working with tiny individual blocks, so you actually were already at only 67% (maximum) storage efficiency, and just didn’t realize it.

Narrowing your vdevs from four-wide to three-wide also improves your system’s ruggedness, because now there are only three points of failure in each of your single-SPOF-tolerant vdevs, instead of four points of failure in each. Win, win, win.

  • That doesn’t sound like it will make my drives chatter less, though…?

No, probably not. And you may never be entirely happy with this, if you don’t like hearing your drives chatter. But you can do a couple of things to minimize it.

Proxmox’s default to 8K volblocksize is, frankly, wildly inappropriate for nearly all the use-cases Proxmox is commonly put to. It would make sense for a dedicated PostgreSQL database engine, because PostgreSQL uses 8KiB page size in its db storage–so you’re matching like to like, for the most optimal read/write patterns for that workload.

But it’s a much smaller than optimal blocksize for nearly everything else–even MySQL and MSSQL databases generally shouldn’t have a volblocksize that low, since MySQL (InnoDB) defaults to 16KiB pages, and MSSQL defaults to 64KiB extents (each extent is eight 8KiB pages, and I/O is overwhelmingly by extent, not by page, in MSSQL). And that’s just databases–volblocksize=8K is, frankly, pants-on-head stupid for general purpose applications like file servers!

Now, for what you’re really asking for–quieting your noisy rust platters–datasets and QCOW2 or raw storage files are the best way to go. But Proxmox will make that pathologically difficult in its own UI, so I can’t really recommend that for you here.

  • Uh, that was all things that suck. Where’s the thing that might NOT suck?

If your workload is typically larger than 8K per I/O, you can relatively trivially convince Proxmox to create the zvols for your VM with a larger volblocksize–although you need to do that at creation time; it’s immutable for the zvol once set (so you can’t just convert your existing VMs in place).

If you’ve got a general-purpose Windows desktop machine for a VM, for example, I’d typically recommend volblocksize=64K for it. This allows for much greater compression (asssuming 4KiB sectors, you can potentially realize actual compression on-disk even if a 64KiB block only compresses down to 60KiB!) as well as fewer of those noisy seeks, since 64KiB of data only needs to get written to a single stripe, rather than to eight stripes.

You also want to adjust your pool topology to match. Remember that recommendation for three-wide Z1? If your vdevs are three wide, and your blocks are 64KiB wide, that means each 64KiB write gets split nice and evenly into 32KiB of data on one drive plus 32KiB of data on another drive plus 32KiB of parity on the final drive in that stripe, rather than a bunch of higgledy-piggledy nonsense.

  • Can I minimize seeks even further, with an even more aggressive topology change?

Absolutely! Ditch the Z1 and go to mirrors, and now you’re only forcing a seek on two drives per write instead of three drives per write. You’ll also get significantly higher performance and faster resilvers out of the deal.

  • I WANT EVEN MORE OPTIMIZATION!

Okay, so now we want to look at not just doing “one big C: drive” on your VMs, but actually giving them multiple drives and storing different data on different drives inside the VM.

Let’s say that I had it right earlier, and your one VM is a general-purpose Windows desktop. Let’s also say you want to store several TiB of bulk data–mostly movies, music, and photos–on your Windows VM.

So in this case, you leave the C: drive of your VM at volblocksize=64K as I recommended earlier, to get you improved performance and efficiency with fewer seeks, and without introducing too much latency on smaller I/O. But now you add a D: drive for your music, movies, and photos–and on your D: drive, you set, say, volblocksize=256K. This reduces the number of seeks necessary for each I/O operation by another factor of eight on your large bulk data!

You probably wouldn’t want to do that on your C: drive, because your C: drive needs to manage a lot of small block I/O operations ranging from log streaming to database and database-like operations (SQLite, the Windows registry, you name it), and large block size with small I/O means poor latency–the last thing you want is to make your desktop feel unresponsive because you increased the effective application latency on its C: drive.

So, you want to do a little experimenting, ideally, and find the sweet spot for your own workload. You might discover that volblocksize=64K isn’t quite low latency enough for your tastes on the C: drive, so you set it to 32K instead–still giving you a 4x factor for relief from the seeks you’re seeing now. And maybe you discover that you don’t have any percpetible latency issues on your D: drive, so you pump its volblocksize all the way up to 1MiB–which will both improve throughput and essentially eliminate platter chatter, since every single I/O to D: will now split 256 total contiguous sectors between your drives, instead of only two. (Assuming three-wide Z1, this means each disk reads and writes in contiguous 128 sector chunks, rather than each disk reading and writing individual sectors with each individual I/O operation).

  • Well. Um. Crap. I think I need a nap now!

Me too, friend, me too. But I hope this helps you plan the next stages of your journey. =)

7 Likes

I’m sorry that was such a freaking novel. Feels like you asked me how to wash your car and I insisted on stepping you through Johnny Cash’s Cadillac before I’d be willing to talk about how to wash it, but sometimes there just ain’t any other way. :slight_smile:

1 Like

That was very informative and I love it!
Definitely gave me a bit to think about especially considering the lack of efficiency utilizing the space and compression with the 4 wide Z1.

1 Like

Oh, I just thought of another way you could reduce the platter chatter–add a special vdev, which will store all those pesky metadata blocks. Just make sure to use high-quality, low-latency, high write endurance SSDs with at least as much fault tolerance as your vdevs–which in the case of either Z1 vdevs or 2-way mirror vdevs, would mean a two-wide mirrored special.

That will keep the metadata operations, which necessarily are pretty much always going to be single-sector, off the noisy bits–and potentially get you a performance boost as well, although honestly I wouldn’t expect too much of that at this scale.

You should be aware, however, that a special vdev is another point of failure for the entire pool. If you lose a special, you lose the entire pool with it, no ifs ands or buts. So, choose wisely based on your own needs. :slight_smile:

This was an absolutely fascinating read, so thanks for that! It does beg the question, if volblocksize=8K is such a problematic default, what would be a more sane default?

I was inspired to do a little testing. The default volblocksize (which you can set when defining a storage pool in PVE) used to be 8K, but is now 16K. It seems like it would make a lot of sense to define different storage pools with different volblocksizes and just pick the appropriate one for the use case.

I don’t use zvols very often myself, because they’ve never tested out well as compared to flat files and I find several advantages in using flat files rather than zvols (dynamic recordsize being one of them, not having refreservation bite me in the ass being another).

With that said, I have tested various blocksizes when hosting VMs on ZFS datasets pretty extensively, and from what I’ve found, the default 64KiB cluster_size that qcow2 uses by default is a pretty solid starting point that works well for most general-purpose workloads.

I would recommend testing different volblocksizes if you’ve got the time and energy to do so, but I suspect most folks will find either 32K or 64K to be the sweet spot for generic latency-sensitive workloads, with potential advantages coming from tightening the belt a little bit on heavily database-centric workloads.

OpenZFS itself defaults to 128KiB blocksize (recordsize) on datasets, and that’s certainly tolerable, but I find it’s a bit “fat” for my taste on generic VM workloads that need to be latency-sensitive (especially read: desktop VMs I want to pull graphical consoles on).

Awesome. This kind of discussion is really what I love about this forum.

You’ve inspired me to do some testing. I run almost every service in Proxmox using LXC rather than KVM, which does use datasets rather than zvols, but I did set up a storage pool with volblocksize=32k and moved a VM there. I’ll move a couple more over once the work day is done and see how they do.

It’s too early to have any meaningful results, but here’s some logistical info for anyone who wants to try. You can specify the volblocksize when you create a storage pool in PVE and it turns out you can specify multiple storage pools with different volblocksizes pointing to the same dataset. As far as PVE is concerned, it treats the underlying storage the same, it just seems to use the relevant volblocksize when it makes new zvols. Also, if you migrate a VM’s disk from one storage pool to another (let’s say I’m moving it from zstore-8k to zstore-32k), it will adopt the volblocksize of the new storage pool.

Pending some more doodling around, that seems to point to a fairly low bar to test these VMs under different conditions.

1 Like