Hard Drives in zfs pool constantly seeking every second

Ender3D · April 19, 2024, 1:44pm

Just finished building a Proxmox server with my first HDD array. Added a Debian VM with access to a mount point on the pool.
While the VM is running, I can hear the read/write arms of all 12 HDDs in sync doing a seek every second, all day long. This stops when the VM is stopped. As far as I can see the system itself is idle.
Is there any reason for this continual seeking? Any way to safely stop it from happening?
Is this perhaps an issue in the way I created my pool?

I created via command line:
zpool create tank raidz1 /disk/SN1 /disk/SN2 /disk/SN3 /disk/SN4 raidz1 /disk/SN5 /disk/SN6 /disk/SN7 /disk/SN8 raidz1 /disk/SN9 /disk/SN10 /disk/SN11 /disk/SN12

mercenary_sysadmin · April 19, 2024, 4:24pm

Oh, this is gonna be a fun one. Apologies, but let me paraphrase your issues for brevity and clarity:

I have a Proxmox pool of thee four-wide RAIDz1 vdevs. Whenever my VM is running, all twelve drives chatter audibly at least once per second, all day long. How come, and can I fix that?

OK, first off: your VM is probably just trickling a stream of writes to disk, for example by way of logging. And that stream of writes is getting sync’d once per second to the underlying metal.

But, why is it happening so often, and is there anything to be done about it? Yes, no, and “maybe not, but there are other problems here that you won’t be happy about,” so let’s look at those.

four-wide RAIDz1 vdevs

So, this is a mild problem, but one that these days most people think isn’t one, even if they’re aware of it. When you set up a four-wide Z1, what you’ve done is create a vdev that must use padding for every single write.

Every block (aka “record” for dataset or “volblock” for zvols) written to a RAIDz vdev must be split up evenly amongst n-p drives, where n is the number of drives in the vdev and p is the parity level.

If you’ve got a four-wide Z1, this means blocks must be split into thirds–and that won’t divide evenly, because recordsize and volblocksize alike are always a power of two. So, every single write needs padding in order to make it come out evenly, wasting both space and performance.

OpenZFS founding developer Matt Ahrens famously–and correctly, so far as it goes–argues that this isn’t as big a deal as most people make it out to be, since OpenZFS already ends up staggering things unevenly due to compression (which changes the number of sectors necessary to store a block) and metadata blocks (which generally only require a single sector, plus parity or redundancy as appropriate). So while there can be an impact, it’s not often big enough to care much about.

I generally argue the other side of this: yes, compression and metadata blocks make “perfect” vdev widths less important than they might naively be considered to be… which is not the same thing as saying it makes no difference whatsoever, and it’s generally not that hard to build a pool out of ideal-width vdevs anyway, so why be wasteful?

But the impact is rather worse here, because of how you’re using your pool. Let’s continue.

Proxmox stores virtual machines in ZVOLs with volblocksize=8k

Matt’s advice is best applied to reasonably compressible files stored in datasets, which is not how Proxmox works. Your virtual machine is stored in a ZVOL, not a dataset–the individual block sizes are not dynamic. Compression can still cause blocks to require fewer sectors to write than normally, which does still mean even “ideal width” vdevs will often come out “uneven” in terms of on-disk layout… but there’s a caveat there as well, thanks to another Proxmox design quirk / design decision.

Proxmox, by default, uses volblocksize=8K on all of your VMs unless you force it to do otherwise at zvol creation time. You didn’t specify what type of rust drive you’re using, or whether you manually specified ashift, but the odds are overwhelmingly good that you’ve got drives with 4KiB sector size, and therefore ashift=12.

So: every single write your VM makes will only be two sectors wide, at maximum. And they won’t often be less than two sectors wide, because compression can’t use “partial sectors”–so if your block can compress to <=50% of its logical size, then you can store it in one data sector (plus parity), otherwise it’ll be stored in two data sectors (plus parity). But you won’t get much in the way of compression ratios, because every single block that “only” compresses to 51% of its original size will still require two sectors plus parity.

This has several unfortunate implications: first, you aren’t getting the storage efficiency you think you’re getting out of those four-wide Z1s, because since you can only write 8K blocks, you never, ever, ever actually manage to perform a full stripe write. Instead, you wind up with three-wide stripes on your four-wide vdevs, because every 8KiB write gets split into two 4KiB sectors (not three!) which get a single 4KiB parity sector to match.

This means that you’re both wasting drive space (compared to your original plan) and introducing a lot more individual drive writes (with corresponding platter chatter), since every write is non-optimal, you get very little compression, and you’re committing all writes in teeny tiny bite size chunks.

Well, shit. Now what? How can I make this less worse?

First of all, I would strongly recommend rethinking your pool layout. If you put those same 12 disks into four 3-wide Z1 instead of three 4-wide Z1, you’ll get more efficient reads and writes, because now you’ll actually be writing three-wide stripes to three-wide vdevs, and will be able to perform 25% more writes per second (since they’re essentially the same three-wide writes, but this time onto four vdevs total instead of three).

Normally, the argument against this would be “but I want 75% storage efficiency instead of 67% storage efficiency”–but this is Proxmox, so you’re working with zvols, not datasets, and you’re working with tiny individual blocks, so you actually were already at only 67% (maximum) storage efficiency, and just didn’t realize it.

Narrowing your vdevs from four-wide to three-wide also improves your system’s ruggedness, because now there are only three points of failure in each of your single-SPOF-tolerant vdevs, instead of four points of failure in each. Win, win, win.

That doesn’t sound like it will make my drives chatter less, though…?

No, probably not. And you may never be entirely happy with this, if you don’t like hearing your drives chatter. But you can do a couple of things to minimize it.

Proxmox’s default to 8K volblocksize is, frankly, wildly inappropriate for nearly all the use-cases Proxmox is commonly put to. It would make sense for a dedicated PostgreSQL database engine, because PostgreSQL uses 8KiB page size in its db storage–so you’re matching like to like, for the most optimal read/write patterns for that workload.

But it’s a much smaller than optimal blocksize for nearly everything else–even MySQL and MSSQL databases generally shouldn’t have a volblocksize that low, since MySQL (InnoDB) defaults to 16KiB pages, and MSSQL defaults to 64KiB extents (each extent is eight 8KiB pages, and I/O is overwhelmingly by extent, not by page, in MSSQL). And that’s just databases–volblocksize=8K is, frankly, pants-on-head stupid for general purpose applications like file servers!

Now, for what you’re really asking for–quieting your noisy rust platters–datasets and QCOW2 or raw storage files are the best way to go. But Proxmox will make that pathologically difficult in its own UI, so I can’t really recommend that for you here.

Uh, that was all things that suck. Where’s the thing that might NOT suck?

If your workload is typically larger than 8K per I/O, you can relatively trivially convince Proxmox to create the zvols for your VM with a larger volblocksize–although you need to do that at creation time; it’s immutable for the zvol once set (so you can’t just convert your existing VMs in place).

If you’ve got a general-purpose Windows desktop machine for a VM, for example, I’d typically recommend volblocksize=64K for it. This allows for much greater compression (asssuming 4KiB sectors, you can potentially realize actual compression on-disk even if a 64KiB block only compresses down to 60KiB!) as well as fewer of those noisy seeks, since 64KiB of data only needs to get written to a single stripe, rather than to eight stripes.

You also want to adjust your pool topology to match. Remember that recommendation for three-wide Z1? If your vdevs are three wide, and your blocks are 64KiB wide, that means each 64KiB write gets split nice and evenly into 32KiB of data on one drive plus 32KiB of data on another drive plus 32KiB of parity on the final drive in that stripe, rather than a bunch of higgledy-piggledy nonsense.

Can I minimize seeks even further, with an even more aggressive topology change?

Absolutely! Ditch the Z1 and go to mirrors, and now you’re only forcing a seek on two drives per write instead of three drives per write. You’ll also get significantly higher performance and faster resilvers out of the deal.

I WANT EVEN MORE OPTIMIZATION!

Okay, so now we want to look at not just doing “one big C: drive” on your VMs, but actually giving them multiple drives and storing different data on different drives inside the VM.

Let’s say that I had it right earlier, and your one VM is a general-purpose Windows desktop. Let’s also say you want to store several TiB of bulk data–mostly movies, music, and photos–on your Windows VM.

So in this case, you leave the C: drive of your VM at volblocksize=64K as I recommended earlier, to get you improved performance and efficiency with fewer seeks, and without introducing too much latency on smaller I/O. But now you add a D: drive for your music, movies, and photos–and on your D: drive, you set, say, volblocksize=256K. This reduces the number of seeks necessary for each I/O operation by another factor of eight on your large bulk data!

You probably wouldn’t want to do that on your C: drive, because your C: drive needs to manage a lot of small block I/O operations ranging from log streaming to database and database-like operations (SQLite, the Windows registry, you name it), and large block size with small I/O means poor latency–the last thing you want is to make your desktop feel unresponsive because you increased the effective application latency on its C: drive.

So, you want to do a little experimenting, ideally, and find the sweet spot for your own workload. You might discover that volblocksize=64K isn’t quite low latency enough for your tastes on the C: drive, so you set it to 32K instead–still giving you a 4x factor for relief from the seeks you’re seeing now. And maybe you discover that you don’t have any percpetible latency issues on your D: drive, so you pump its volblocksize all the way up to 1MiB–which will both improve throughput and essentially eliminate platter chatter, since every single I/O to D: will now split 256 total contiguous sectors between your drives, instead of only two. (Assuming three-wide Z1, this means each disk reads and writes in contiguous 128 sector chunks, rather than each disk reading and writing individual sectors with each individual I/O operation).

Well. Um. Crap. I think I need a nap now!

Me too, friend, me too. But I hope this helps you plan the next stages of your journey. =)

mercenary_sysadmin · April 19, 2024, 4:45pm

I’m sorry that was such a freaking novel. Feels like you asked me how to wash your car and I insisted on stepping you through Johnny Cash’s Cadillac before I’d be willing to talk about how to wash it, but sometimes there just ain’t any other way.

Ender3D · April 19, 2024, 5:39pm

That was very informative and I love it!
Definitely gave me a bit to think about especially considering the lack of efficiency utilizing the space and compression with the 4 wide Z1.

mercenary_sysadmin · April 19, 2024, 5:52pm

Oh, I just thought of another way you could reduce the platter chatter–add a special vdev, which will store all those pesky metadata blocks. Just make sure to use high-quality, low-latency, high write endurance SSDs with at least as much fault tolerance as your vdevs–which in the case of either Z1 vdevs or 2-way mirror vdevs, would mean a two-wide mirrored special.

That will keep the metadata operations, which necessarily are pretty much always going to be single-sector, off the noisy bits–and potentially get you a performance boost as well, although honestly I wouldn’t expect too much of that at this scale.

You should be aware, however, that a special vdev is another point of failure for the entire pool. If you lose a special, you lose the entire pool with it, no ifs ands or buts. So, choose wisely based on your own needs.

tvcvt · April 20, 2024, 6:35pm

This was an absolutely fascinating read, so thanks for that! It does beg the question, if volblocksize=8K is such a problematic default, what would be a more sane default?

I was inspired to do a little testing. The default volblocksize (which you can set when defining a storage pool in PVE) used to be 8K, but is now 16K. It seems like it would make a lot of sense to define different storage pools with different volblocksizes and just pick the appropriate one for the use case.

mercenary_sysadmin · April 20, 2024, 7:01pm

I don’t use zvols very often myself, because they’ve never tested out well as compared to flat files and I find several advantages in using flat files rather than zvols (dynamic recordsize being one of them, not having refreservation bite me in the ass being another).

With that said, I have tested various blocksizes when hosting VMs on ZFS datasets pretty extensively, and from what I’ve found, the default 64KiB cluster_size that qcow2 uses by default is a pretty solid starting point that works well for most general-purpose workloads.

I would recommend testing different volblocksizes if you’ve got the time and energy to do so, but I suspect most folks will find either 32K or 64K to be the sweet spot for generic latency-sensitive workloads, with potential advantages coming from tightening the belt a little bit on heavily database-centric workloads.

OpenZFS itself defaults to 128KiB blocksize (recordsize) on datasets, and that’s certainly tolerable, but I find it’s a bit “fat” for my taste on generic VM workloads that need to be latency-sensitive (especially read: desktop VMs I want to pull graphical consoles on).

tvcvt · April 22, 2024, 6:27pm

Awesome. This kind of discussion is really what I love about this forum.

You’ve inspired me to do some testing. I run almost every service in Proxmox using LXC rather than KVM, which does use datasets rather than zvols, but I did set up a storage pool with volblocksize=32k and moved a VM there. I’ll move a couple more over once the work day is done and see how they do.

It’s too early to have any meaningful results, but here’s some logistical info for anyone who wants to try. You can specify the volblocksize when you create a storage pool in PVE and it turns out you can specify multiple storage pools with different volblocksizes pointing to the same dataset. As far as PVE is concerned, it treats the underlying storage the same, it just seems to use the relevant volblocksize when it makes new zvols. Also, if you migrate a VM’s disk from one storage pool to another (let’s say I’m moving it from zstore-8k to zstore-32k), it will adopt the volblocksize of the new storage pool.

Pending some more doodling around, that seems to point to a fairly low bar to test these VMs under different conditions.

mercenary_sysadmin · September 14, 2024, 7:45pm

So how did this turn out? Did you see any improvement?

I am a little concerned that you might not actually get the results you expect from this, depending on how Proxmox is migrating those VMs under the hood–if it’s using OpenZFS replication for the task, your block size won’t change despite the volblocksize serring being different; volblocksize is immutable once set, and replication does not resize blocks, it just moves them as-is.

tvcvt · September 14, 2024, 10:33pm

That’s interesting. Life has been too chaotic to do any before/after comparisons, but performance feels a little a little better. I did just run zfs get volblocksize on a couple of the zvols that I moved and it they show the appropriate value set in the storage.

mercenary_sysadmin · September 14, 2024, 10:50pm

I did just run zfs get volblocksize on a couple of the zvols that I moved and it they show the appropriate value set in the storage.

That tells you the volblocksize property of the zvol, which might not necessarily correspond to the actual block size present in the zvol. Again, I really hope that the Proxmox authors accounted for this, but the property on the dataset won’t always match the size of the blocks in a dataset.

This is usually more of a concern on filesystem datasets, which have variable blocksize. Zvols don’t have variable blocksize; their blocksize is immutable once set. HOWEVER, if you’re using replication, the properties on the target are irrelevant; your existing blocks are sent as-is and written as-is. On a filesystem dataset, new files written will honor the new recordsize value, but the existing ones will not change, either on the original or on a replicated target with a different value.

Now, volblocksize is a bit funkier–there is no dynamic block size in a zvol, so all blocks must actually be the same size. The thing I’m not so sure of is whether it’s still possible to end up with actual block sizes that don’t match the volblocksize property, the way you can end up with GiB of files in a “recordsize=1M” dataset that are actually written in 128KiB blocks because you started out with that setting.

Essentially, since replication won’t rewrite blocks, what you really need to do is quiesce the zvol (or work from a snapshot instead of the original), then use a tool like dd to move every single sector of its data into a new zvol with a different volblocksize. That may very well be how operations triggered via Proxmox’s UI already do this task; I just don’t know off the top of my head and it’s a concern to be answered!

tvcvt · September 15, 2024, 2:54pm

That’s fascinating. If volblocksize doesn’t report the existing value, is there a reliable way to tell what’s actually going on?

I haven’t noticed that new VMs created with the larger block size behave differently than those which were moved from one storage to another. Take that with a grain of salt, of course. We’re at the far limits of my storage knowledge.

mercenary_sysadmin · September 15, 2024, 5:29pm

Yes and no. The only way I know of to examine the size of blocks as-stored on disk is pool-wide, so it may not be helpful if you’ve got a lot of data that really should have different block sizes.

root@elden:/# zdb -bbb rpool

Traversing all blocks to verify nothing leaked ...

loading concrete vdev 0, metaslab 115 of 116 ...
1.36T completed (13767MB/s) estimated time remaining: 0hr 00min 00sec        
	No leaks (block sum matches space maps exactly)

[output elided]
Block Size Histogram

  block   psize                lsize                asize
   size   Count   Size   Cum.  Count   Size   Cum.  Count   Size   Cum.
    512:   213K   106M   106M   213K   106M   106M      0      0      0
     1K:   215K   259M   365M   215K   259M   365M      0      0      0
     2K:   322K   812M  1.15G   322K   812M  1.15G      0      0      0
     4K:  3.94M  16.1G  17.3G   309K  1.75G  2.90G  3.60M  14.4G  14.4G
     8K:  5.34M  51.3G  68.6G   245K  2.64G  5.54G  6.32M  59.1G  73.5G
    16K:  58.1M   933G  1002G  65.3M  1.02T  1.03T  58.1M   935G  1008G
    32K:   893K  36.4G  1.01T  33.1K  1.36G  1.03T   839K  34.0G  1.02T
    64K:  4.10M   264G  1.27T  5.72M   366G  1.39T  4.16M   270G  1.28T
   128K:   658K  82.2G  1.35T  1.36M   174G  1.56T   658K  82.3G  1.36T
   256K:      0      0  1.35T      0      0  1.56T    230  62.3M  1.36T
   512K:      0      0  1.35T      0      0  1.56T      0      0  1.36T
     1M:      0      0  1.35T      0      0  1.56T      0      0  1.36T
     2M:      0      0  1.35T      0      0  1.56T      0      0  1.36T
     4M:      0      0  1.35T      0      0  1.56T      0      0  1.36T
     8M:      0      0  1.35T      0      0  1.56T      0      0  1.36T
    16M:      0      0  1.35T      0      0  1.56T      0      0  1.36T

I don’t do any bulk storage to speak of on this system, so I don’t have any recordsize configured at greater than 128KiB. But if your Proxmox system should be almost entirely VMs at a specific volblocksize, you should be able to use a pool-wide histogram to see whether you’re at the volblocksize you think you are.

It will be normal to have a few smaller blocks in the pool no matter what–metadata blocks, whatever Proxmox has stored in the root filesystem, etc. But if all of your VMs are recordsize=32K and you see almost all of your blocks are 8K or 16K, well, you know what’s up.

mercenary_sysadmin · September 15, 2024, 5:43pm

I’ve been talking this over with Allan Jude, and you can test a specific zvol, but the output is a bit hairy and ugly to parse. Let’s examine two zvols, one created with volblocksize=64K, the other created with volblocksize=256K, both given a quick sprinkling of data via mks.ext4 before we poke at them:

root@elden:/# zdb -dddddd rpool/zvol2 | grep -A1 dblk
    Object  lvl   iblk   dblk  dsize  dnsize  lsize   %full  type
         0    6   128K    16K    48K     512    16K    6.25  DMU dnode (K=inherit) (Z=inherit=zstd-unknown)
--
    Object  lvl   iblk   dblk  dsize  dnsize  lsize   %full  type
         1    3   128K    64K   112K     512     1G    0.10  zvol object (K=inherit) (Z=inherit=zstd-unknown)
--
    Object  lvl   iblk   dblk  dsize  dnsize  lsize   %full  type
         2    1   128K    512      0     512    512  100.00  zvol prop (K=inherit) (Z=inherit=zstd-unknown)
root@elden:/# zdb -dddddd rpool/zvol3 | grep -A1 dblk
    Object  lvl   iblk   dblk  dsize  dnsize  lsize   %full  type
         0    6   128K    16K    48K     512    16K    6.25  DMU dnode (K=inherit) (Z=inherit=zstd-unknown)
--
    Object  lvl   iblk   dblk  dsize  dnsize  lsize   %full  type
         1    3   128K   256K    72K     512     1G    0.22  zvol object (K=inherit) (Z=inherit=zstd-unknown)
--
    Object  lvl   iblk   dblk  dsize  dnsize  lsize   %full  type
         2    1   128K    512      0     512    512  100.00  zvol prop (K=inherit) (Z=inherit=zstd-unknown)

What we’re looking for here is the dblk (data block) column. Thing is, as you can see, there are three entries for both zvols–there’s always one entry for 512B and another for 16K data blocks.

But we can do a better job of highlighting only the data we’re looking for! Notice that the object type for the blocks we’re looking for is the only one called “zvol object”, so instead of piping through grep dblk -A1 we can instead pipe through grep -B1 zvol\ object:

root@elden:/# zdb -dddddd rpool/zvol2 | grep -B1 zvol\ object
    Object  lvl   iblk   dblk  dsize  dnsize  lsize   %full  type
         1    3   128K    64K   112K     512     1G    0.10  zvol object (K=inherit) (Z=inherit=zstd-unknown)

root@elden:/# zdb -dddddd rpool/zvol3 | grep -B1 zvol\ object
    Object  lvl   iblk   dblk  dsize  dnsize  lsize   %full  type
         1    3   128K   256K    72K     512     1G    0.22  zvol object (K=inherit) (Z=inherit=zstd-unknown)

There you have it: zvol2 is composed of 64KiB blocks, and zvol3 is composed of 256KiB blocks.