Should recordsize be divisible by number of vdevs?

I know striping in ZFS is dynamic, and will distribute the data as optimally as possible across the pool.

I’m wondering if setting the recordsize according to the number of vdevs in the pool would be better for performance or avoiding fragmentation?

I remember seeing an example on r/zfs that ZFS will break up a 128K record into 32K chunks when there are 4 vdevs in the pool. But what if the record can’t be divided evenly?

Example: A fresh pool on 3 equal vdevs. For a dataset with large files, would a recordsize of 768K be better than 1M since 768 is divisible by 3?

And how does ZFS distribute the data if the dataset is sent to another pool with 2 or maybe 4 vdevs? 768 is also divisible by those numbers.

1 Like

Recordsize must always be a power of 2, so this doesn’t come up. It’s the other way around: you size your vdevs so that they’ll be evenly divisible by recordsize, which in practice means the number of drives minus the parity level should itself be a power of 2.

For example, consider a ten-wide RAIDz2 vdev: each stripe is data across eight drives and parity across two.

Every possible recordsize will divide evenly by eight (although some may wind up requiring so few sectors that they get an undersized stripe). If we assume ashift=12 and therefore 4KiB sectors:

Recordsize=4K: saved in three-wide stripe (one data, two parity)
Recordsize=8K: saved in four-wide stripe (two data, two parity)
Recordsize=16K: saved in six-wide stripe (four data, two parity)
Recordsize=32K: saved in ten-wide stripe (eight data, two parity)

So far, we’re literally only writing a single sector per drive for each of these very small block sizes on this relatively wide Z2 vdev. So our performance is going to suck pretty bad on this kind of workload.

Things begin to get better from here, although ten-wide Z2 is never going to be a high performance choice of vdev:

Recordsize=64K: saved in two-sector chunks on ten-wide stripe
Recordsize=128K: saved in four-sector chunks on ten-wide stripe
Recordsize=256K: saved in eight-sector chunks on ten-wide stripe
Recordsize=512K: saved in sixteen-sector chunks on ten-wide stripe
Recordsize=1M: saved in thirty-two-sector chunks on ten-wide stripe

So, for that last setting, recordsize=1MiB, we’re splitting 1,024KiB into eight pieces–each one 128KiB (thirty-two 4KiB sectors) wide. And we have parity-one and parity-two calculated in two more 128KiB pieces which go onto the remaining two drives of the vdev, in that full stripe.

With very wide vdevs–and yes, ten-wide Z2 definitely counts–you may want to consider going even higher than recordsize=1M, now that it’s possible to do so without messing around with kernel tunables. Essentially, the closer you can get your individual drives to doing 1MiB I/O operations, the better your throughput can get.

The only downside is, you’d better be certain you’re not putting any database-style storage on these wide vdevs and wide recordsizes! You don’t have to worry about small files–that 4KiB file is going to get stored in a single 4KiB data sector with however much parity or redundancy, regardless of any setting you choose–but if you wind up with something like SQLite flat-file “databases” or even real database engines like MySQL or PostgreSQL and very large recordsizes and/or wide vdevs, you will have a very bad time indeed.

1 Like

I understand the math above, but I guess it’s still not clear to me why everything should always be in powers of 2. This is my thought process. Hopefully it makes sense.

For a pool with 3 vdevs, a recordsize of 768K or 960K should stripe evenly across the pool and still play nice with 4K sectors. And if I later move the dataset(s) to a pool with 2, 4, 6 or 8 vdevs, those recordsizes should still split evenly. Though, I’m not sure how the data is redistributed when a dataset is received on another pool with different vdev layout.

If I choose a recordsize of 960K…
2 vdevs: stripe is 480K, writes 120 4K sectors (per block device)
3 vdevs: stripe is 320K, writes 80 4K sectors
4 vdevs: stripe is 240K, writes 60 4K sectors
6 vdevs: stripe is 160K, writes 40 4K sectors
8 vdevs: stripe is 120K, writes 30 4K sectors

If it makes a difference, the vdevs are mirrors and not raidz. Typical files would be ISO images, HD movies, lossless audio, etc… No active VMs or databases.

Hopefully I’m not way off on this though process.

Sure, and a sky of lilac waves with little green polka dots and pink paisley on top might look pretty, but such a thing does not exist. :cowboy_hat_face:

Recordsize is always a power of two, period. If it were not, it would be (even, and considerably) easy(-er) to produce really maladaptive, inefficient configurations. Remember, you can play all sorts of monkeyshines far enough up the stack, but once you get down to the bare metal, the bare metal operates in binary increments, always.

This is why we went from 512 byte (2^9) sectors to 4KiB (2^12) sectors, instead of having 1.5KiB sectors or some such. And before anyone gets excited about 520-byte sectors in enterprise drives… Those still only carry 512B of actual data; the remaining eight bytes are used for an eight byte checksum, not for actual data!

And they do so because moving data across the actual processor / through the actual RAM is done efficiently in precise powers of two, and precise powers of two only, because these are binary systems, and that’s how binary systems must function at the bare metal level if you want to achieve full efficiency.