I’ve bought a brand new Samsung Pro 990 nvme SSD only to now find out that the device does not support 4k Block sizes. It by default is formatted to 512B and you cannot make it go to 4k. I tried with the nvme-cli tools and it only lists 512B as supported LBAs.
This SSD is supposed to run zfs for hosting various virtual machines. I now wonder what would be a reasonable alignment. I typically align to 4k block sizes (-o ashift=12) even for block devices that report to use only 512B block sizes, even though I assume this is only an emulated block size. I would assume the native internally used block size is still 4k.
Does it ever make sense to go back to ashift=9 for this device or any modern SSD/HDD at all?
Internally the actual flash block sizes are much larger than 4K, likely 128K and up as the manufacturers move to larger flash chip sizes. I’m not sure it makes sense to try to match that either and IAC the vendors don’t publish those details anyway.
If possible, the best option is to try both block sizes and perhaps even larger blocks (ashift-13,14) if you have reason to believe it better matches your workload. I did try ashift=13 on SATA SSDs and could not measure any difference between that and ashift=12 using synthetic benchmarks. Of course SATA != NVME.
I did some other benchmarks including kernel builds to compare two pools. One was a 1TB Samsung 980 PRO (single drive) and a pair of 2TB Intel 670p mirrored. One of the Intel SSDs was in a PCIe adapter in an X1 slot so full bandwidth was not available. I was surprised that kernel builds took about the same time on either pool, so storage bandwidth was not a limiting factor. This is on my “new” system with a Ryzen 7 7700X processor and still processor limited.
Thank you! You’re right, for this workload ashift=16 might be a good choice, given that the default qcow2 cluster size is also 64K. I’m not sure yet, if I’m going to use qcow2 or just raw images though.
I’m however reasonable confident, that anything lower than 4k will not give me any meaningful performance malus for this workload, with a larger block size even being a bit better.
You don’t want ashift to match the qcow2 cluster size. If it does, you won’t be able to get any compression.
Essentially, the only reason to bump the ashift up here is to make it less potentially problematic when and if you need to replace the disk, and might want to replace it with something that has 4K sectors. If the device itself is nvme and reports that it’s 512B, in my experience that means performance will not improve with larger ashift.
Either. In my experience, nvme SSDs (decent ones, not bargain bin nonsense) that report as 512B will not gain increased performance by setting ashift higher than the 9 that 512B implies.
You may still want to use ashift=12 anyway, for future compatibility reasons. But it’s unlikely, IME, to gain you performance.
TLDR: Stick with the device blocksize if performance is an objective.
I got curious and ran some fio simulations. Overall the performance characteristics between 512B and 4k are somehow similar, however in my simulation a write intensive simulation shows that ashift=9 outperforms ashift=12 by about 30%. While those benchmarks are somehow synthetic and must not translate into the same differences for your actual workload, I do think they are representative for my scenario in a qualitative way, not necessarily quantitatively though.
In more detail
See my results in the table in results.txt and the corresponding repository where I put also the fio simulation files and a description.
In my simulation, the scenario with mostly reading VMs has on average 20% more throughput and latency, although also a standard deviation 20% larger. Interestingly the write throughput is 20% higher for the 4k case here so they somehow compensate each other.
Simulation 3 is a scenario with mostly writing VMs and shows some interesting differences. Here, the 512B configuration outperforms the 4k by about 30% in terms of read bandwidth and 60% in write bandwidth, but again with the standard deviation being also by 30% and 60% larger respectively. The latency is also about 30% lower for the 512B configuration. Here 512B really shows an increased performance compared to my 4k configuration.
Looking at both the bandwidth and the runtime stats of the whole simulation gives a somehow consistent view:
simulation1 (baseline) is somehow similar
simulation2 (mostly reading VMs) gives 45.3/24.4MiB/s at a runtime of 23s for 512B vs. 38.9/20MiB/s at a runtime of 27s
simulation3 (mostly writing VMs) gives 15.7/29.1MiB/s at 36.5s for 512B and 12.2/22.6MiB/s at 47s
Values are group stats and not directly translatable to nvme stats.
EDIT: Given the larger standard deviation I find the overall runtime stats the more robust metric to look at. One could look at the histogram distribution that fio gives, but that’s beyond something I want to do now.
Doesn’t this comment conflict with this statement from your ZFS 101 article for Ars Technica from back in 2020:
In real world terms, this amplification penalty hits a Samsung EVO SSD—which should have ashift=13 , but lies about its sector size and therefore defaults to ashift=9 if not overridden by a savvy admin—hard enough to make it appear slower than a conventional rust disk.
From what I understand from some googling, as well as of this discussion, it’s pretty much a safe choice to use ashift=12 in most cases, including for the fake 512B SSDs. But reading the aforementioned article I’ve noticed that statement, that implies some scary consequences. So I’m now wondering which one is it.
I’ve got a similar setup as OP: ZFS mirror of 2 990 pro NVMEs for VM and container storage in Proxmox with default configuration - ashift=12, recordsize=128K for containers and volblocksize=16K for VM volumes. I’m new to ZFS, trying to learn the basics to avoid the possible pitfalls. My main concern is not to kill the SSDs prematurely by write amplification due to misconfiguration.
I also have the same question about conflicting ashift advice.
For the 2 TB 990 Pro, would setting ashift=9 increase or decrease its write amplification or performance? It’s confusing as well because some SSDs like the SN850X can apparently be set to 4k LBA through nvme-cli. Would this mean that ashift=12 would then allow ZFS to perfectly align with the drive? It became even more confusing to me when I found a reddit post where multiple users reported using ashift=14 on an SN850X.
I am currently looking to do something similar to what the poster above has done, except using ZFS on Debian. It would be a ZFS mirror, with the main hope being the drives’ longevity.
The problem with these things is we’ve no idea what internal flash page size is sitting on the far side of the drive’s flash translation layer. It’s almost certain to be at least 4k though.
The second problem with over-optimizing here is the very real possibility of running into issues down the road from combining devices with (or needing to combine) a mixed ashift. I believe most everything in the ZFS world defaults to 12 these days – it works with 512e (512 logical / 4k physical), it works with 4kn, and ashift=13 outperformance seems to have been mostly limited to a handful of SSDs from 8-10 years ago.
ashift=9 in 2025 sounds like a bug in search of a windshield.
Personally, I’ve benched all of my NVMe drives as both 512/4k and 4kn and the performance gap has always been within 5% for any access pattern; read or write. No real winner here. I’ve yet to see ashift=13 outperform with any of them, old or new. I guess a big performance win might have me reconsider but I’m still waiting for that day. I reckon the bottom line is test and see before we add anything to a pool.
As for write amplification I’m in favor of having a UPS, setting a long txg commit interval, and having a ZIL (SLOG) device present. Better to provide more time for to gather/coalesce those choppy, tiny, amplification-prone writes into larger sequential writes. The ZIL/SLOG is to keep sync writes from spoiling the long and lazy txg soup (everyone says “not needed” as few of us have any sync writes – I emphatically disagree and will die on this hill).