Recommendations for ZFS as NAS for Proxmox cluster

Hi, I plan to deploy a Ubuntu 24.04 server with 6x1TB SAS SSD and 12x2TB HDD as a dedicated storage server for 3 or 4 other servers running a Proxmox cluster.
The server has 2x12-core CPUs and 192GB DDR4 RAM and will only be used as storage server.
I want to store the VM disks on this server as a shared storage for the cluster to be able to move the VMs around the cluster.
The storage server will be connected with 2x10G and each proxmox server will have its dedicated 10G-NIC for storage only access.

What is the best current practice for a setup like this?
qcow over NFS? or zvols over iSCSI?

What settings should I use for ZFS to get good performance and other tuning tips?
I will probably make a RAID10 of all the HDDs (stripe of mirrors) and use the SSDs as L2ARC, SLOG and maybe special-vdev if that makes sense.
What other tunables should I be chnage? Recordsize and what else?

Thanks for any input!
/Markus

2 Likes

You probably won’t benefit noticeably from L2ARC.

SLOG will only benefit you if you have a heavily synchronous write workload (for example, if you use sync NFS for your storage transport, or if you’re running a lot of database engines in your VMs). Most workloads are not particularly synchronous, so I wouldn’t implement this blindly.

The special generally only becomes significantly useful on an extremely fragmented pool, in my experience.

TL;DR: you may want to consider just building your server with two pool, a fast SSD pool and a slow rust pool. I’d advise mirrors in both cases. The big exception is if you decide to go with sync NFS for your transport, since that will make everything sync.

I’d advise actually testing both NFS targeting qcow2 (or raw) files, and iSCSI targeting zvols. Generally speaking, iSCSI is a higher-performance protocol, IMO… but zvols tend to underperform, on the storage side, so I’m not sure which will wind up giving you the best overall results.

Recordsize / volblocksize settings will be crucial to getting the performance you expect. 64K is usually a good generic default. If that’s not providing you with the performance you expect, you need to start thinking about your actual storage workload: database engines will want blocksize reduced to their page size or possibly double their page size; bulk file storage with no random access inside those files will want blocksize ENLARGED significantly.

For a regular filesystem, I advise recordsize=1M for datasets containing very large files. But when you’re talking VMs, you may want to be a bit more cautious: a ZFS filesystem dataset can and will have a mixture of block sizes inside it, so your metadata will still be in 4KiB blocks while your large files can be in, for example, 1MiB blocks if that’s what you set recordsize to. But when you’re dealing with VMs, whether qcow2, raw, or zvol, you can no longer mix and match blocksizes (because the entire VM sits inside a single file or volume with a fixed blocksize) so you may want to be a bit more cautious. IMO, IME 256K or 512K is about as large as you want to go for a VM store, and you only want to go that large for a virtual drive that will only be serving large files, generally in their entirety (not random access with small blocks INSIDE those large files).

Finally, remember that while you cannot have a mix of blocksizes inside the same zvol, qcow2, or raw, there’s nothing stopping you from giving a VM multiple virtual drives, each with a different size. A webserver that both serves “Linux ISOs” and has a large MySQL InnoDB database, for example, might have two drives: one with recordsize/volblocksize of 16K, for the MySQL database store, and another with recordsize/volblocksize of 512K, for the “Linux ISOs” it serves.

Finally, remember that you don’t have to do every last thing I talked about here. I’m giving you bread crumbs you can follow if you need to, not a set of mandated directions you must follow no matter what! :slight_smile:

2 Likes

My only thought is that a 1TB HDD is going to use about the same power as a 10 or 20 TB HDD and the power usage and heat from all of those somewhat small drives is going to add up.

2 Likes