Proxmox - ZFS - NFS - SMB hyper convergence?

JanHolbo · October 16, 2024, 12:11pm

Initial upgrade plans: Separate fileserver and Hypervisor(s)

So I got the final bits and bobs for my fileserver upgrade yesterday. As you may recall from my previous post in the TrueNAS category I was planning to keep file server and hypervisor seperate as they are now. But during the waiting time I had continued to mull over the Hyper convergence mentioned by Jim Salter in my original post about upgrading my TrueNAS machine.

So new plan: install Proxmox and setup a VM with TrueNAS and pass the HBA through to this VM. With everything arrived, I had a stab at it yesterday and it failed. I know - I should have waited to when I have more time - I am going to London the next few days, I just wanted to sort it beforehand. Now everything is on hold (and the server down :-/) until I return on Sunday.

But! I would like to pick a few of your brains in the mean time. The plan I would like to explore a bit more would be to ditch the TrueNAS VM and work directly off the Proxmox ZFS pool(s) and export them via NFS and SMB. The main reason for me abandoning the TrueNAS VM is the “split” of the limited availbale memory and the intricasy and mutual interdependency between Proxmox and its VMs and Containers and the TrueNAS VM.

Hardware

So the Hardware is as follows:

ASUS Z97-K Motherboard
Intel i5 4440
24 GB RAM (2 x 8 + 2 x 4)
Boot drive: 256 GB NVMe
SAS 3008 HBA
Older Nvidia Quadro GfX card
a number of drives; mainly an exported ZFS mirror pool fatagnus consisting of 2 x 4 TB WD Red + 2 x 3 GB Toshiba N300

Boot pool

What is the main difference of having an LVM boot pool versus a ZFS boot pool? Does it make a difference that it is on an NVMe? I think it might be easier to make a backup of a ZFS pool, but I really dont know.

Storage pool

The plan is to transition the storage pool from the mentioned 2 + 2 mirror to a 6 disk RAIDZ2. I have to do that through a couple of transitions:

2+2 mirror to a mirror with an 8 TB and a 4 TB drive. I am going to do that using rsync as I want to reconfigure the datasets. These are the most important data - the less important (mainly media I can recreate) have been offloaded to single HDDs.
destroy pool fatagnus and recreate it as a 6 disk RAIDZ2 with an additional 1 TB and 2 TB drive I have. Do a ZFS send/receive from the temporary mirror.
Destroy the temporary mirror pool and replace the 1 TB and 2 TB drives with the 4 TB and 8 TB respectively - one at a time.
Setup an SSD mirror for VM/Container storage.

Question:
What difference does a larger asize than 12 do? Is it necessary? What are the costs?

Question:
I have a mix of NAS and consumer drives - and I might buy refurbished enterprise drives in the future. Is it possible to upgrade the firmware on the drives in order to make the drive present itself with a 4k blocksize rather than 512 bytes virtual? And does it matter?

Question:
I intend to (continue) to setup the storage pool(s) with dedup. I remember reading that memory needs for that is about 1 GB per 1 TB storage. Does a larger asize lessen this? I take it that it might lessen the deduplication ratio?

As adding a metadata vdev incurs more risk (if you loose the vdev, you loose the pool) it is recommended to have as much fault tolerance with the metadata vdev as with the storage pool itself. So for my RAIDZ2 setup that would be a 3 disk mirror.

Question: How does an SSD based mirror metadata vdev fail if it is fails because of write wear? Would it recognize that the metadata and small blocks still are readable just not updateable and then put new metadata and small blocks on the main vdevs until the special vdev is resilvered?

As I now dont have the TrueNAS UI to setup shares, is there anything I need to pay special attention to in this regard? Are there any quirks in the way Proxmox handles ZFS that might override my efforts?

So far I have been using container storage over NFS partly because of having storage on one box hypervisor on another. As I have other Hypervisors this gives me the possibility to migrate containers from one box to another.

Question:
Is it possible to run VM storage over NFS for Linux VMs? For Windows VMs?

I hope all these questions ar not too overwhelming. If it makes sense to you then you are more than welcome to answer just one or two questions - Thank You in advance.

adaptive_chance · October 16, 2024, 8:58pm

That’s a lot of territory. Only one part I feel confident in answering:

What difference does a larger asize than 12 do? Is it necessary? What are the costs?

Some SSDs use an 8k internal page size but they present themselves to the host as the usual 512e/4kn logical/physical. A 512b or 4k blocksize takes a performance hit on these drives and accelerates flash wearout due to write amplification (excessive read/modify/write activity).

I don’t know how many of these drives actually exist. I’m hearing some of the latest SSDs (and QLC drives in particular) may use an even larger internal page size.

My personal preference is ashift-12 at a bare minimum regardless of HDD/SSD. I benchmark new SSDs to check performance with 13 and 14. I’ve yet to see any improvement. But one might come along someday…

As for the downside, I believe one ashift unit is the lower limit for compression granularity. So a 15k file (ferinstance) that compresses to 12k would consume three ashift-12 blocks (12k) but two ashift-13 blocks (16k hence zero effective compression). Hope that made sense.

jay_tuckey · October 17, 2024, 1:32am

This is a good plan, and this is how I run my storage at home. Because Proxmox is Debian underneath, all the SMB and NFS support is there, just an apt install away. To get instructions on how to configure SMB and NFS, you can search something like “debian how to set up NFS”. I don’t have links to any specific guides, but there are many online. Only downside is you have to use CLI and config files, there’s no nice GUI to configure shares.

asize sets the sector size of the disk. So if you set asize to 9, it says the disk has 512 byte sectors, if you set it to 12 it says the disk has 4096 byte sectors. As @adaptive_chance mentioned, if your asize is too low, you will get slow writes and potentially wear an SSD faster than needed. If your asize is too high, you will use more disk space for data than you need, as the smallest written block will be asize sized, so a 2K file will use 8K of disk space if you set asize to say 13.

The setting you want to use to reduce the dedup table (DDT) size is recordsize. This property is the maximum size of a record that ZFS will write. It may use multiple sectors, but will only have one entry in the DDT.

If you have mostly large files, setting recordsize to 1M will likely get you some good savings. A good writeup on recordsize is About ZFS recordsize – JRS Systems: the blog

As an example, one of my file servers is mostly used to store family photos and videos, and the family has a habit of copying these images to multiple places on the server, so I get a decent saving by having dedup on. Because images are mostly bigger than 1M in size, they will usually use a few 1M records, then a few smaller ones. When copied, they are copied whole so still use large 1M records mostly. My pool is:

> zpool list
NAME    SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
poolb  2.72T  1.22T  1.50T        -         -     7%    44%  1.52x    ONLINE  -

And my dedup table looks like this:

> zdb -S poolb
Simulated DDT histogram:

bucket              allocated                       referenced
______   ______________________________   ______________________________
refcnt   blocks   LSIZE   PSIZE   DSIZE   blocks   LSIZE   PSIZE   DSIZE
------   ------   -----   -----   -----   ------   -----   -----   -----
     1    1.04M    945G    894G    894G    1.04M    945G    894G    894G
     2     355K    316G    276G    276G     776K    689G    598G    598G
     4    95.8K   74.4G   66.0G   66.0G     468K    355G    313G    313G
     8    13.0K   9.33G   8.00G   8.01G     126K   88.9G   76.4G   76.5G
    16    2.35K    611M    472M    474M    45.5K   11.7G   8.91G   8.95G
    32      161   50.3M   36.4M   36.5M    6.62K   2.17G   1.55G   1.55G
    64       25   5.36M   2.73M   2.77M    1.97K    369M    191M    194M
   128        4    988K    178K    188K      725    145M   26.4M   28.3M
   256        6   2.02M   2.02M   2.03M    2.15K    663M    663M    666M
   512        1    512B    512B      4K      526    263K    263K   2.05M
    1K        2      1K      1K      8K    2.10K   1.05M   1.05M   8.38M
    2K        1    512B    512B      4K    2.89K   1.44M   1.44M   11.6M
 Total    1.49M   1.31T   1.22T   1.22T    2.44M   2.04T   1.85T   1.85T

dedup = 1.52, compress = 1.11, copies = 1.00, dedup * compress / copies = 1.68

So total, I have 1.49M records in the DDT. Total memory usage for this (according to How To Size Main Memory for ZFS Deduplication) should be (1.49M) x 320 = 454.7 MB of memory used.

Be aware that if your data is mostly small blocks, you may have much higher memory usage for your DDT. If my 1.22T of data was exclusively 8K records, I calculate I’d have 163M records, meaning close to 50G of memory for the DDT. This page Deduplication | ZFS Handbook recommends 5G of RAM per 1TB of deduped data.

Also, it’s worth being aware that even if you can keep the whole DDT in memory, you will still see write speed penalty for having dedup on. This is because for each record written ZFS needs to also go write an entry in the DDT. For my use case of family photos this is not a problem, but I could see it being annoying when running a VM on a dedup’d volume.

It’s also possible to set dedup on only the datasets that have data you expect to get dedup savings on.

Rickerdo · October 18, 2024, 9:51pm

I see a lot of folks installing Proxmox, then try to pass through an HBA to a virtual NAS. Playing devil’s advocate for a minute… Why?

It’s much simpler to install something like UnRAID that handles the NAS functionality natively and provides a nice clean management interface for VMs (Linux KVM) and containers (Docker). Open Media Vault does the same, but but UnRAID’s UI, IMHO, seems a bit more polished/intuitive. TrueNAS Scale is supposed to provide the same functionality, with KVM and Docker management added in, but those features are currently in beta.

JanHolbo · October 18, 2024, 10:04pm

I have been using Proxmox for VMs and LXC Containers and TrueNAS for storage on separate boxes.

I have no experience with UnRAID but do know that it is a paid for product rather than OpenSource with a paid support option like both ProxMox and TrueNAS.
I started out with what is now TrueNAS Core which is BSD based. Now Core is deprecated and Scale - which is based on Linux - is the new default.

Maybe Docker would be better for me, I do not really know yet. But I will explore that in the future. For now I will be digging into the CLI documentation for ZFS + Proxmox + NFS + SMB and setup my hyperconverged server. Some of that research will include backup of containers and VMs as well as storage - My first stop will be Sanoid and syncoid. But I am also going to look at Proxmox Backup server and Ceph.

JanHolbo · October 18, 2024, 10:06pm

Thank you very much for clearing up this!

JanHolbo · October 18, 2024, 10:21pm

I will have to look into this. Having multiple but similar based VMs would benefit from dedup i.e. with multiple Windows VMs the ‘base’ Windows files would only take up space once.

Rickerdo · October 18, 2024, 10:22pm

I too have considered, and still do consider even after paying for an UnRAID license, going the CLI route on top of Debian 12. I used to run KVM and Docker on a previous box and it served me well for many years. Then, and I promise I’m not trying to be an evangelist, I found out about UnRAID. I learned to like that it does not require sacrificing an internal drive for the base install as it boots from a thumb drive and runs in RAM. It’s really easy to backup your config including container configurations which makes DR very simple. I was really turned off by the price, but after a month in, I have no regrets.

FWIW - I’m not suggesting “My way is better than yours”. Just trying to pass long something I recently found out myself while looking for a simple hyper-converged solution. Also, trying to see if I missed something regarding the Proxmox/virtual NAS route.

Best of luck.

JanHolbo · October 18, 2024, 10:56pm

Not in any way taken as such!

I have not looked into UnRAID mainly for two reasons - one was the price, but the main reason was that it did not offer ZFS, which has now changed.

Proxmox uses LXD containers rather than docker which means that you can select whether the container runs privileged or not. I prefer the latter, but has had to make my Jellyfin container run as privileged as it needs to mount the media storage through NFS.

I decided to keeps this approach even though it adds an overhead as it makes it possible for me to migrate the container off the proxmox with the media storage - I will loose the hyperconvergence, but keep the ease of migrating the container.

jay_tuckey · October 19, 2024, 3:17am

I’ve been able to make Jellyfin run in an unprivileged container under LXC by passing through the storage from the host - see:
https://jaytuckey.name/2024/05/09/notes-on-setting-up-jellyfin-in-lxc-container-on-proxmox/

The key command under proxmox is:

root@smoltuck #/mediacontent[2024-05-11 10:51]
> pct set 100 -mp0 /mediacontent,mp=/mediacontent

Not sure if this would work for your use case. Perhaps you could mount the NFS on the host and pass it through?

JanHolbo · October 21, 2024, 9:31am

I will look into that as well. That is a viable option. The reason - so far - that I have kept away from that is 1) the storage and the container were on to different boxes 2) Even on a hyperconverged box, I would like to be able to migrate the container to a different Proxmox host. Using NFS (or another network fs) allows that.

I am going to plan a little better for this try, but as my current fileserver is down, there is also an urgency

I was also running a Nextcloud server that I need to get back up. I am probably going to get it back up and then look into splitting the nextcloud OS+server and hosting that on the SSD based pool and put the storage part on spinning rust based pool.

great900jeans · October 21, 2024, 4:09pm

Something I didn’t see mentioned by skimming but I’ll recommend- limit your ZFS ARC to a certain max to avoid it taking too much RAM away from your actual VMs and such.

Steps-

vim /etc/modprobe.d/zfs
Formula is $sizeInGB * 1024 * 1024 * 1024 to get bytes

options zfs zfs_arc_min=536870912
options zfs zfs_arc_max=1073741824

Above sets 512MB min, 1GB max (desktop)
Adjust to some value relevant to proxmox and how much total RAM you have
To enforce, regenerate initramfs
- Fedora- dracut -fv --kver $(uname -r)
- Ubuntu/Debian- update-initramfs -u -k all

Doublecheck those regeneration commands. They are just from old notes of mine.