Encrypt SLOG or Special Vdevs?

mkg · February 6, 2025, 11:34pm

I’m standing up a new storage server on a reasonably beefy machine: Ryzen 7 9700X, 32 GiB DDR5, four 14 TB reconditioned data center SATA drives, and two 1 TB NVMe drives on PCIx gen4 x4. The ZFS pools will use LUKS-encrypted partitions. I plan to use a SLOG and a special vdev. My question: should the SLOG or special vdevs also be LUKS-encrypted?

Background: the plan is to use the NVMe drives for an 8 GiB mirrored SLOG, a XX GiB mirrored special vdev (no small files stored), and the rest as a fast mirrored pool. I plan to over provision by leaving 25% of the capacity unpartitioned to reduce the impact of the SLOG being write-heavy.

Note: I’m pretty sure it would be better to not partition the NVMe for multiple functions like this. But the budget has been exhausted. Also this has a consumer motherboard so it can’t hold very many NVMe. As such multiple functions seems a reasonable compromise for this decidedly non-enterprise use. It will be holding a media library served via NSF/Samba, and VM images. Please advise if this is still a bad idea.

mercenary_sysadmin · February 7, 2025, 12:37am

Still a bad idea, sorry. Especially when you said the magic word “NVMe” without specifying what model of NVMe drive.

I know the ludicrously simple marketing benchmarks make every bottom dollar consumer targeted NVMe drive look like it’s dripping in magic dust, but when you target them with heavy-duty real world workloads–and acting as a SLOG and a SPECIAL not only counts as a heavy-duty workload, provisioning them if you DON’T have a heavy-duty workload is a complete waste of money–they fall flat on their faces.

A consumer NVMe drive (with no further specifications given) is usually not a great idea for ONLY a LOG or SPECIAL vdev, let alone trying to do both at once.

My advice would be, if you’re sure you’re actually going to have a lot of sync writes in your workload–which the vast majority of workloads do not, the most common exceptions being database engines and NFS servers–then provision one of those NVMe SSDs as a LOG vdev, and nothing else. You don’t need the SPECIAL, I promise, especially at this scale.

If you have absolutely no other use for the second NVMe drive whatsoever, you can either press it into service as a CACHE vdev, or you can add it as a second LOG vdev (not adding it to the first as a mirror, adding it to the pool as ANOTHER log vdev) to decrease the workload when dealing with bursts of sync writes.

Another possiblity, and this is more frequently the way I go: just don’t put your heavy-duty workloads on the rust in the first place. Make the rust a bulk storage pool that just gets bulk storage data on it, and relegate the difficult workloads (VM root drives, database engines, etc) to a second pool, built on nothing but the SSDs.

I’m still concerned that you may end up very disappointed by those NVMe drives anyway, but they should at least significantly outperform the rust in the first pool, even if they do fall flat on their faces when their write cache is exhausted (which generally happens within 5-30 seconds of heavy write activity, on most consumer-targeted SSDs, and SOONER on most consumer-targeted NVMe drives).

I am genuinely sorry, I know this is unpleasant to hear. <3

mercenary_sysadmin · February 7, 2025, 12:40am

And to be clear, if you have a VERY light workload, you may very well be perfectly happy with your NVMe drives, even if they are inexpensive consumer models. I occasionally press one into service in a desktop machine myself, although if we’re being entirely honest about it, only when I wind up with one on my hands for free for some reason.

But I’ve seen them crap out on demanding workloads way too frequently not to cringe everytime I see somebody just say a drive they’re planning to use for a pool is “NVMe,” and stop describing them any further.

mkg · February 7, 2025, 12:37pm

Thanks for the advice. I will indeed re-evaluate.

BTW, the drives are 1 TB Samsung SSD 990 Pro. Higher end but consumer just the same.

It is no longer important but I am still curious about encrypting SLOG or special vdevs? It seems to me that not encrypting them would potentially expose information and negate the encryption of the other vdevs.

mercenary_sysadmin · February 7, 2025, 5:06pm

You are correct, not encrypting a LOG vdev absolutely exposes every sync write you make, though only briefly.

Specials are a bit trickier. With default settings, all you expose with an unencrypted special is the metadata. Not encrypting your metadata may expose potentially sensitive information like file names, and could VERY theoretically make it possible for an INCREDIBLY advanced attacker in a really odd scenario to directly edit your metadata, giving them a toe hold possibly into manipulating your storage in a way that might grant them a pivot to some other layer.

That’s all VERY theoretical though; the only really significant risk there is exposing your metadata to reads. That’s assuming default settings, though. If you set special_small_blocks to 4K, for example, you’ll also expose all your dotfiles, since the DATA blocks of a file that small also hit the special in that case. You get the idea.

Samsung 990 Pro is a great example of a theoretically prosumer NVMe drive that is far less capable than it appears. Yes it supports massively high throughput for the simplest possible workloads (which you don’t have many or any of, in real life). But unlike the generations of SATA Samsung Pro models that came before it, the 990 “Pro” is all TLC–not MLC like the “slower” older SATA models–which means it REALLY falls off a cliff (eg write throughput at 10MiB/sec or less, with CRIPPLING associated latency) once you exhaust the small SLC cache allocated “at the top” of its physical media.

Meanwhile, the “slower” SATA Samsung Pros were ENTIRELY allocated MLC, and as a result could be relied on to keep providing 250+MiB/sec even for much more difficult workloads that might stay that busy for HOURS.

And even those older “Pros” get their asses handed to them–again, in tougher and more sustained workloads, NOT on simple single process sequential “tests”–by most proper enterprise or datacenter grade drives, like the Kingston DC600Ms I recommend so frequently.

If you’d like to examine the difference between “consumer fast” and “real workload fast” in more detail, check out this shootout I did for Ars a few years ago: https://arstechnica.com/gadgets/2020/12/high-end-sata-ssd-shootout-samsung-860-pro-vs-kingston-dc500m/

Spoiler for those who won’t click through:

This is one of those unrealistically easy workloads I mentioned: and it’s still WAY more difficult than the ones the vendors themselves actually use, which are almost always pure consecutive writes, like you’d issue to a tape drive.

Oshit, sync writes have entered the picture! This is what you get from database engines, the vast majority of NFS exports, a lot of the stuff hypervisors do with storage, and occasional burst of “just because” from any given application or library that wants to make damn sure it knows all of its writes have already been committed before it does anything else.

A surprising number of workloads demand not just “lots of operations fast”, but “lots of individual storage operations fast, and the actual thing you’re trying to doesn’t finish until ALL of those ops finish.” This is what we’re modeling here, and despite modeling it gently–with sixteen individual fully asynchronous processes, allowing parallelization–we can see how frequently we’re going to get a crappy result out of the Samsung.

If you need ten storage ops to complete before the application you’re using visibly does the thing you want it to, even with those ops in parallel, your application latency isn’t the median or the average latency, it’s the slowest return out of all ten parallel ops.

Now, look again at that latency chart. By the 95th percentile, the Samsung is taking 52ms to complete a write, while the Kingston is still finishing in a mere two milliseconds.

That’s bad enough when all you need is a single op: but if you need 10 storage ops to complete for each application op, that means a typical application op completes in <2ms for the Kingston, and WELL over 25ms for the Samsung. Brutal!

mkg · February 7, 2025, 7:00pm

Thanks Jim. Looks like I got snookered by their marketing department. I don’t like it when that happens but I’m a more educated purchaser now.

Brian · February 8, 2025, 8:29pm

@mercenary_sysadmin

Can the same thing be said of the Kingston DC600M drives, as it appears to be the replacement for the DC500M? If you want to have an M.2 format NVMe then would you suggest the DC2000B?

mercenary_sysadmin · February 8, 2025, 9:03pm

I haven’t done the extensive benchmarking on the DC600M, but I’ve been using them in production, and I haven’t seen anything that makes me feel like they’re a lesser model than the DC500M series were.

I’ve got probably thirty or forty of the DC600M in the field right now.

Probably! I haven’t ever had one of the DC2000 series on the bench, but I did test one of their earlier U.2 series (DC1000M, I think?) on a PCIe U.2 host interface, and it was certainly a baller (and then some).

That might also be something to keep in mind, btw–although chassis with U.2 hotswap bays are still astronomically expensive as far as I’ve been able to shop, you can buy very inexpensive PCIe x4 adapters (I paid $15 for the one I bought in order to test the Kingston U.2 drive).

Going with an inexpensive U.2 PCIe adapter lets you get more NVMe drives into your box than your mobo gave you M.2 slots–and it also widens your available choice of drives.

You’re also going to tend to get higher reliability and longer endurance out of U.2 drives, because they’re not forced into that horrible little candy-stick M.2 format. Looking at the DC1500M vs the DC2000B specifically, the former is rated for 1.0 DWPD (Drive Writes Per Day), and the latter only at 0.4 DWPD. That’s a massive difference!

Now, for M.2 NVMe specifically, I do like the looks of the DC2000B. In addition to the power loss protection and presumably (I’m guessing here unfortunately) the same or similar hardware QoS as the DC600 SATA and DC1500 U.2 line offer, it’s got an integrated aluminum heat spreader/sink. That should hopefully at least mitigate the M.2 form factor’s thermal dissipation problems a little.