Should recordsize be divisible by number of vdevs?

ZeroSignal9 · May 20, 2025, 1:51am

I know striping in ZFS is dynamic, and will distribute the data as optimally as possible across the pool.

I’m wondering if setting the recordsize according to the number of vdevs in the pool would be better for performance or avoiding fragmentation?

I remember seeing an example on r/zfs that ZFS will break up a 128K record into 32K chunks when there are 4 vdevs in the pool. But what if the record can’t be divided evenly?

Example: A fresh pool on 3 equal vdevs. For a dataset with large files, would a recordsize of 768K be better than 1M since 768 is divisible by 3?

And how does ZFS distribute the data if the dataset is sent to another pool with 2 or maybe 4 vdevs? 768 is also divisible by those numbers.

mercenary_sysadmin · May 20, 2025, 4:52am

Recordsize must always be a power of 2, so this doesn’t come up. It’s the other way around: you size your vdevs so that they’ll be evenly divisible by recordsize, which in practice means the number of drives minus the parity level should itself be a power of 2.

For example, consider a ten-wide RAIDz2 vdev: each stripe is data across eight drives and parity across two.

Every possible recordsize will divide evenly by eight (although some may wind up requiring so few sectors that they get an undersized stripe). If we assume ashift=12 and therefore 4KiB sectors:

Recordsize=4K: saved in three-wide stripe (one data, two parity)
Recordsize=8K: saved in four-wide stripe (two data, two parity)
Recordsize=16K: saved in six-wide stripe (four data, two parity)
Recordsize=32K: saved in ten-wide stripe (eight data, two parity)

So far, we’re literally only writing a single sector per drive for each of these very small block sizes on this relatively wide Z2 vdev. So our performance is going to suck pretty bad on this kind of workload.

Things begin to get better from here, although ten-wide Z2 is never going to be a high performance choice of vdev:

Recordsize=64K: saved in two-sector chunks on ten-wide stripe
Recordsize=128K: saved in four-sector chunks on ten-wide stripe
Recordsize=256K: saved in eight-sector chunks on ten-wide stripe
Recordsize=512K: saved in sixteen-sector chunks on ten-wide stripe
Recordsize=1M: saved in thirty-two-sector chunks on ten-wide stripe

So, for that last setting, recordsize=1MiB, we’re splitting 1,024KiB into eight pieces–each one 128KiB (thirty-two 4KiB sectors) wide. And we have parity-one and parity-two calculated in two more 128KiB pieces which go onto the remaining two drives of the vdev, in that full stripe.

With very wide vdevs–and yes, ten-wide Z2 definitely counts–you may want to consider going even higher than recordsize=1M, now that it’s possible to do so without messing around with kernel tunables. Essentially, the closer you can get your individual drives to doing 1MiB I/O operations, the better your throughput can get.

The only downside is, you’d better be certain you’re not putting any database-style storage on these wide vdevs and wide recordsizes! You don’t have to worry about small files–that 4KiB file is going to get stored in a single 4KiB data sector with however much parity or redundancy, regardless of any setting you choose–but if you wind up with something like SQLite flat-file “databases” or even real database engines like MySQL or PostgreSQL and very large recordsizes and/or wide vdevs, you will have a very bad time indeed.

ZeroSignal9 · May 21, 2025, 12:09am

I understand the math above, but I guess it’s still not clear to me why everything should always be in powers of 2. This is my thought process. Hopefully it makes sense.

For a pool with 3 vdevs, a recordsize of 768K or 960K should stripe evenly across the pool and still play nice with 4K sectors. And if I later move the dataset(s) to a pool with 2, 4, 6 or 8 vdevs, those recordsizes should still split evenly. Though, I’m not sure how the data is redistributed when a dataset is received on another pool with different vdev layout.

If I choose a recordsize of 960K…
2 vdevs: stripe is 480K, writes 120 4K sectors (per block device)
3 vdevs: stripe is 320K, writes 80 4K sectors
4 vdevs: stripe is 240K, writes 60 4K sectors
6 vdevs: stripe is 160K, writes 40 4K sectors
8 vdevs: stripe is 120K, writes 30 4K sectors

If it makes a difference, the vdevs are mirrors and not raidz. Typical files would be ISO images, HD movies, lossless audio, etc… No active VMs or databases.

Hopefully I’m not way off on this though process.

mercenary_sysadmin · May 21, 2025, 12:40am

Sure, and a sky of lilac waves with little green polka dots and pink paisley on top might look pretty, but such a thing does not exist.

Recordsize is always a power of two, period. If it were not, it would be (even, and considerably) easy(-er) to produce really maladaptive, inefficient configurations. Remember, you can play all sorts of monkeyshines far enough up the stack, but once you get down to the bare metal, the bare metal operates in binary increments, always.

This is why we went from 512 byte (2^9) sectors to 4KiB (2^12) sectors, instead of having 1.5KiB sectors or some such. And before anyone gets excited about 520-byte sectors in enterprise drives… Those still only carry 512B of actual data; the remaining eight bytes are used for an eight byte checksum, not for actual data!

And they do so because moving data across the actual processor / through the actual RAM is done efficiently in precise powers of two, and precise powers of two only, because these are binary systems, and that’s how binary systems must function at the bare metal level if you want to achieve full efficiency.

littlenewton · May 27, 2025, 6:53am

As far as I know, recordsize is a file-system/logical parameter.

After creating a logical record, the ZIO pipeline works, and the SPA module allocates physical space. During the ZIO process, a compressing pipeline is usually used, and after the compression, the size of this record is not predictable.

So in my ZFS practice, I don’t focus on maintaining a strict alignment between recordsize and vdev stripe width, since compression and other factors make the final on-disk size unpredictable. I prefer to benchmark with real-world workloads and tune based on observed performance.

mercenary_sysadmin · May 27, 2025, 4:57pm

Oh damn. I just realized, I spent a lot of text not answering the ACTUAL question you asked.

An individual record, aka block, is placed onto a single vdev, period. An individual block is never split between multiple vdevs. So no, you do not need to match your recordsize to the number of vdevs in your pool, the one makes absolutely no difference to the other.

My apologies for confusedly answering at great length why you aren’t matching recordsize to the width of your vdevs, since you’re actually doing it the other way around for the most part. I’m going to leave all that up there in case it helps somebody, but this is the very short and simple answer to your pretty short and simple question. Apologies!

ZeroSignal9 · May 27, 2025, 8:14pm

It’s all good! This answer does help clarify what you explained above.