Recommendations for ZFS as NAS for Proxmox cluster

markusklock · September 16, 2024, 5:13pm

Hi, I plan to deploy a Ubuntu 24.04 server with 6x1TB SAS SSD and 12x2TB HDD as a dedicated storage server for 3 or 4 other servers running a Proxmox cluster.
The server has 2x12-core CPUs and 192GB DDR4 RAM and will only be used as storage server.
I want to store the VM disks on this server as a shared storage for the cluster to be able to move the VMs around the cluster.
The storage server will be connected with 2x10G and each proxmox server will have its dedicated 10G-NIC for storage only access.

What is the best current practice for a setup like this?
qcow over NFS? or zvols over iSCSI?

What settings should I use for ZFS to get good performance and other tuning tips?
I will probably make a RAID10 of all the HDDs (stripe of mirrors) and use the SSDs as L2ARC, SLOG and maybe special-vdev if that makes sense.
What other tunables should I be chnage? Recordsize and what else?

Thanks for any input!
/Markus

mercenary_sysadmin · September 16, 2024, 7:29pm

You probably won’t benefit noticeably from L2ARC.

SLOG will only benefit you if you have a heavily synchronous write workload (for example, if you use sync NFS for your storage transport, or if you’re running a lot of database engines in your VMs). Most workloads are not particularly synchronous, so I wouldn’t implement this blindly.

The special generally only becomes significantly useful on an extremely fragmented pool, in my experience.

TL;DR: you may want to consider just building your server with two pool, a fast SSD pool and a slow rust pool. I’d advise mirrors in both cases. The big exception is if you decide to go with sync NFS for your transport, since that will make everything sync.

I’d advise actually testing both NFS targeting qcow2 (or raw) files, and iSCSI targeting zvols. Generally speaking, iSCSI is a higher-performance protocol, IMO… but zvols tend to underperform, on the storage side, so I’m not sure which will wind up giving you the best overall results.

Recordsize / volblocksize settings will be crucial to getting the performance you expect. 64K is usually a good generic default. If that’s not providing you with the performance you expect, you need to start thinking about your actual storage workload: database engines will want blocksize reduced to their page size or possibly double their page size; bulk file storage with no random access inside those files will want blocksize ENLARGED significantly.

For a regular filesystem, I advise recordsize=1M for datasets containing very large files. But when you’re talking VMs, you may want to be a bit more cautious: a ZFS filesystem dataset can and will have a mixture of block sizes inside it, so your metadata will still be in 4KiB blocks while your large files can be in, for example, 1MiB blocks if that’s what you set recordsize to. But when you’re dealing with VMs, whether qcow2, raw, or zvol, you can no longer mix and match blocksizes (because the entire VM sits inside a single file or volume with a fixed blocksize) so you may want to be a bit more cautious. IMO, IME 256K or 512K is about as large as you want to go for a VM store, and you only want to go that large for a virtual drive that will only be serving large files, generally in their entirety (not random access with small blocks INSIDE those large files).

Finally, remember that while you cannot have a mix of blocksizes inside the same zvol, qcow2, or raw, there’s nothing stopping you from giving a VM multiple virtual drives, each with a different size. A webserver that both serves “Linux ISOs” and has a large MySQL InnoDB database, for example, might have two drives: one with recordsize/volblocksize of 16K, for the MySQL database store, and another with recordsize/volblocksize of 512K, for the “Linux ISOs” it serves.

Finally, remember that you don’t have to do every last thing I talked about here. I’m giving you bread crumbs you can follow if you need to, not a set of mandated directions you must follow no matter what!

HankB · September 17, 2024, 12:40pm

My only thought is that a 1TB HDD is going to use about the same power as a 10 or 20 TB HDD and the power usage and heat from all of those somewhat small drives is going to add up.

SinisterPisces · December 27, 2024, 1:12am

I’ve come to the point of wanting to implement NAS-backed VM storage, and wound up at this thread.

I had a question about part of your reply:

SLOG will only benefit you if you have a heavily synchronous write workload (for example, if you use sync NFS for your storage transport, or if you’re running a lot of database engines in your VMs). Most workloads are not particularly synchronous, so I wouldn’t implement this blindly.

NFS/QCOW2 seems to be the most recommended way to do shared storage for VMs in Proxmox in particular, if only because while iSCSI is theoretically faster, it’s a lot trickier to implement and maintain on the Proxmox side. So, all the tutorials and best practices guides really push NFS. NFS does not like sync writes on slow storage.

That got me thinking about my setup. I got a couple of shockingly good deals, so I’m going to have a single mirror of 4 TB NVME (on slots downgraded to PCIe 3.0x2) for my VM and database storage. I could theoretically add a second name mirror (running at PCIe 3.0x2, again, because this motherboard is Like That®) for a slog for the databases and VMs, but as those drives should all be about the same speed, it’s unclear to me that a setup like that would be of any real benefit.

All the examples I’ve seen assume the slog vdev is considerably faster than the main storage vdev it’s associated with.

My instinct tells me that offloading log activity to a separate vdev would free up the data vdev to do more data vdev things interrupted, but I’m not sure if that’s worth burning a PCIe slot that could do something else.

I’m going to benchmark it both ways, but I was curious if this was a common enough question that there’s a generally known correct answer.

mercenary_sysadmin · December 27, 2024, 3:17am

Only really worth it if your LOG vdev’s drives have power loss protection, and your storage vdevs’ SSDs do not. Otherwise (at this scale and topology) it’s a waste.

If you were rocking a wide RAIDz vdev, the answer might be a bit more nuanced. But not when you’re already running mirrors in the pool.

SinisterPisces · December 27, 2024, 4:23am

Thanks!

The NVME I’d be using for the slog are:

Model Number:                       Seagate IronWolf510 ZP480NM30001-2S9301

Specs: https://www.seagate.com/www-content/datasheets/pdfs/ironwolf-510-ssd-DS2032-1-2002US-en_US.pdf
The specs don’t mention anything about power loss protection; these are targeted at NAS cache use cases, I think. They’re not enterprise disks.

They’re also the 480 GB model, so they do 2450/290 sequential read/write on 128KB QD32. Not really sure how to contextualize that or the IOPS; I mostly got them because they were new in box on eBay for firesafe prices and had really good endurance for the QNAP read/write cache I was doing then.

mercenary_sysadmin · December 27, 2024, 4:44am

The “performance specs” on datasets are essentially worthless. They’re the largest number that can be coaxed out of the drive under an idealized, unrealistic workload: bigger numbers here do not necessarily correspond with larger numbers on realistic workloads.

In terms of being useful as a LOG attached to a pool of solid state mirrors, the only thing that matters is powerloss protection: and you’re right, these don’t have it.

I’d find a different use for them. Usually not too hard to find a good home for an SSD or two IME.

SinisterPisces · December 27, 2024, 2:26pm

I am shocked, just absolutely shocked, to hear manufacturers doctor their spec sheets. (Hopefully it’s not as bad for actual enterprise-targetted products?)

In a way, I’m glad to know a slog would be useless in this box. I’ve only got one PCIe slot, and wanted to add a 10 GbE NIC. Now I can be confident I haven’t made a mistake doing that.

mercenary_sysadmin · December 27, 2024, 3:30pm

IME enterprise targeted storage solutions game metrics into utter uselessness just the same as consumer targeted varieties do.

The more enterprise targeted, the greater NUMBER of metrics you’re likely to see, mind you. But what you’re seeing from this Ironwolf datasheet is about as good as it gets, even in the enterprise space.

If you have no soul, it seems to be the winning late stage capitalism strategy: there are more people who don’t understand the metrics than do, so you tailor your presentation to the ones who don’t know anything, and the ones who do know what they’re doing are going to test for themselves anyway, so why make it easier for them?

SinisterPisces · December 27, 2024, 3:43pm

The more enterprise targeted, the greater NUMBER of metrics you’re likely to see, mind you. But what you’re seeing from this Ironwolf datasheet is about as good as it gets, even in the enterprise space.

If you have no soul, it seems to be the winning late stage capitalism strategy: there are more people who don’t understand the metrics than do, so you tailor your presentation to the ones who don’t know anything, and the ones who do know what they’re doing are going to test for themselves anyway, so why make it easier for them?

Well, this is entirely depressing and I agree completely, so I’m just going to proceed with drinking chocolate tea and tinkering with my Minecraft server.

Younger me would be so disappointed, but older me needs a nap, to be honest.

(In all seriousness, it’s disappointing to know even enterprise specs are garbage, but at least we’ve got good outlets out there doing reliable benchmarking tests.)

mercenary_sysadmin · December 27, 2024, 4:17pm

I think the least depressing accurate take is that there are still people who want to actually help. They just aren’t always directly employed by the companies you wish would employ them!

adaptive_chance · January 7, 2025, 4:35am

Assuming a general purpose mixed-use VM workload I’ve found L2ARC with l2arc_mfuonly=1 to be hugely beneficial – at least in my lab. I shut stuff down quite a bit and the L2ARC persistence helps.

When I fire up these VMs I get a nice fat string of L2ARC cache hits while they come to life.

I reckon for someone who keeps it all running 24/7 it’d be less helpful…