Block size alignment with 512B SSD

phoenix · May 1, 2024, 9:03am

Hey peeps,

I’ve bought a brand new Samsung Pro 990 nvme SSD only to now find out that the device does not support 4k Block sizes. It by default is formatted to 512B and you cannot make it go to 4k. I tried with the nvme-cli tools and it only lists 512B as supported LBAs.

This SSD is supposed to run zfs for hosting various virtual machines. I now wonder what would be a reasonable alignment. I typically align to 4k block sizes (-o ashift=12) even for block devices that report to use only 512B block sizes, even though I assume this is only an emulated block size. I would assume the native internally used block size is still 4k.

Does it ever make sense to go back to ashift=9 for this device or any modern SSD/HDD at all?

HankB · May 1, 2024, 12:45pm

That issue came up recently (as in the last couple days) on the mailing list. You can see that at https://zfsonlinux.topicbox.com/groups/zfs-discuss/T3276b803aa49b0ca-Ma38561e2a120bbd8ef3a983d. One suggestion was to match the actual block size. I’m not sure that’s the best option.

Internally the actual flash block sizes are much larger than 4K, likely 128K and up as the manufacturers move to larger flash chip sizes. I’m not sure it makes sense to try to match that either and IAC the vendors don’t publish those details anyway.

If possible, the best option is to try both block sizes and perhaps even larger blocks (ashift-13,14) if you have reason to believe it better matches your workload. I did try ashift=13 on SATA SSDs and could not measure any difference between that and ashift=12 using synthetic benchmarks. Of course SATA != NVME.

I did some other benchmarks including kernel builds to compare two pools. One was a 1TB Samsung 980 PRO (single drive) and a pair of 2TB Intel 670p mirrored. One of the Intel SSDs was in a PCIe adapter in an X1 slot so full bandwidth was not available. I was surprised that kernel builds took about the same time on either pool, so storage bandwidth was not a limiting factor. This is on my “new” system with a Ryzen 7 7700X processor and still processor limited.

TL;DR YMMV

phoenix · May 1, 2024, 5:25pm

Thank you! You’re right, for this workload ashift=16 might be a good choice, given that the default qcow2 cluster size is also 64K. I’m not sure yet, if I’m going to use qcow2 or just raw images though.

I’m however reasonable confident, that anything lower than 4k will not give me any meaningful performance malus for this workload, with a larger block size even being a bit better.

mercenary_sysadmin · May 2, 2024, 12:04am

You don’t want ashift to match the qcow2 cluster size. If it does, you won’t be able to get any compression.

Essentially, the only reason to bump the ashift up here is to make it less potentially problematic when and if you need to replace the disk, and might want to replace it with something that has 4K sectors. If the device itself is nvme and reports that it’s 512B, in my experience that means performance will not improve with larger ashift.

phoenix · May 3, 2024, 5:38am

With larger ashift you mean larger than 512 or than 4k?

mercenary_sysadmin · May 3, 2024, 1:15pm

Either. In my experience, nvme SSDs (decent ones, not bargain bin nonsense) that report as 512B will not gain increased performance by setting ashift higher than the 9 that 512B implies.

You may still want to use ashift=12 anyway, for future compatibility reasons. But it’s unlikely, IME, to gain you performance.

phoenix · May 3, 2024, 6:10pm

TLDR: Stick with the device blocksize if performance is an objective.

I got curious and ran some fio simulations. Overall the performance characteristics between 512B and 4k are somehow similar, however in my simulation a write intensive simulation shows that ashift=9 outperforms ashift=12 by about 30%. While those benchmarks are somehow synthetic and must not translate into the same differences for your actual workload, I do think they are representative for my scenario in a qualitative way, not necessarily quantitatively though.

In more detail

See my results in the table in results.txt and the corresponding repository where I put also the fio simulation files and a description.

In my simulation, the scenario with mostly reading VMs has on average 20% more throughput and latency, although also a standard deviation 20% larger. Interestingly the write throughput is 20% higher for the 4k case here so they somehow compensate each other.

Simulation 3 is a scenario with mostly writing VMs and shows some interesting differences. Here, the 512B configuration outperforms the 4k by about 30% in terms of read bandwidth and 60% in write bandwidth, but again with the standard deviation being also by 30% and 60% larger respectively. The latency is also about 30% lower for the 512B configuration. Here 512B really shows an increased performance compared to my 4k configuration.

Looking at both the bandwidth and the runtime stats of the whole simulation gives a somehow consistent view:

simulation1 (baseline) is somehow similar
simulation2 (mostly reading VMs) gives 45.3/24.4MiB/s at a runtime of 23s for 512B vs. 38.9/20MiB/s at a runtime of 27s
simulation3 (mostly writing VMs) gives 15.7/29.1MiB/s at 36.5s for 512B and 12.2/22.6MiB/s at 47s

Values are group stats and not directly translatable to nvme stats.

EDIT: Given the larger standard deviation I find the overall runtime stats the more robust metric to look at. One could look at the histogram distribution that fio gives, but that’s beyond something I want to do now.

mercenary_sysadmin · May 3, 2024, 7:03pm

Yep, this all tracks with what I’ve seen when testing 512B NVMe. No surprises.