Three-site Proxmox + ZFS (15×16 TB/site, special vdev, PBS, encrypted off-site) — layout sanity check & best-practice questions

aeolius · September 15, 2025, 8:05pm

Hi everyone!
Three friends are building mirrored homelabs and backing up to each other. I’d love a sanity check and your best practices. I’m a tinkerer, not an expert—please be candid (flame me if you must, but be kind ).

Priorities & workloads

Priority: Resilience > Performance > Capacity (target ≥80 TB usable on the main site after ~20% free space is kept).
Workloads (Proxmox 9):
- Virtualized UTM (been virtualizing pfSense for ~10 years; OPNsense for ~3 years).
- Proxmox Backup Server (PBS) on the same host (for VMs/LXCs/services; media/data handled separately via ZFS send/recv).
- Media & general file storage, family photos.
- Windows 11 “work PC” VM; SPICE (and NoVNC) are the only remotes that keep working while the guest is on corporate VPN (I believe Proxmox proxying helps).
- Light databases for self-hosted services; some AI tinkering.
Compression: leaning lz4 unless you convince me otherwise (open to zstd guidance).

Hardware (my site; the other two are similar but smaller)

Chassis: Rosewill 4500 (15×3.5″), strong airflow.
CPU/RAM: AMD EPYC 7532, 512 GB ECC (8×64 GB—unlikely to grow; larger DIMMs are too pricey).
Motherboard: ASRock Rack ROMED8-2T/BCM; all PCIe slots support x4/x4/x4/x4 bifurcation.
HBA: Broadcom/LSI 9400-16i → SATA Exos HDDs.
NIC: Mellanox ConnectX-4 (MCX4121A-ACAT) 25 GbE dual SFP28.
Spinning disks: 15 × 16 TB Seagate Exos (mix of batches, some recerts; all are 16 TB despite “Exos 16/18” naming).
NVMe (consumer): 4 × 4 TB FireCuda 530 + 2 × 512 GB FireCuda 530. No PLP; endurance: 4 TB = 5100 TBW, 512 GB = 640 TBW.
Other SSDs in the group: one site has Samsung 983 DCT M.2 960 GB (as I understand, with PLP); another site has unknown consumer Samsung “Pros.”
Power: my site = hours of UPS; the others ≈5–15 min.
LAN/WAN: my LAN mostly 1 GbE (UniFi Switch Pro 24 PoE with 2×10 GbE); WAN 1 Gbps symmetrical. Other sites ~50–150 Mbps down / ~25 Mbps up. For “fast restore,” plan is direct 25 GbE DAC between servers when physically co-located.

Pool design I’m considering (please critique!)

Main HDD pool: 7 × 2-way mirrors + 1 hot spare (14 data, 1 spare).
- Aiming to clear ≥80 TB usable with ~20% free kept.
- Mirrors for IOPS and fast rebuilds. I might later convert some to 3-way mirrors as larger disks arrive—but not now.
Special vdev (metadata/small files + DDT for “fast dedupe”): 4 × 4 TB FireCuda 530 arranged as two mirrored special vdevs (i.e., special class = stripe of two mirrors).
- Tentative special_small_blocks = 32K or 64K to accelerate filesystems/VMs/service configs and small media metadata.
- I know lose special = lose pool; hence mirrored special vdevs, cooling, and frequent replication.
Boot: 2 × 512 GB FireCuda 530 mirrored for Proxmox boot (may also hold ISOs/container images in a small dataset).
SLOG / L2ARC: No SLOG (I don’t expect sync-heavy exports) and no L2ARC (512 GB RAM + special vdev should suffice).
Datasets: Prefer many datasets to tune per use case (recordsize/volblocksize, atime, special_small_blocks, dedup).
Dedup: Not pool-wide. Considering it only where it truly pays (shared game libraries, VM/template libraries, maybe photo duplicates across project datasets). I’d love rules-of-thumb that fit 512 GB RAM and consumer NVMe special vdevs.

Replication, encryption & PBS

Backups: local PBS first (for VMs/LXCs/services), then off-site. Media/data handled via ZFS send/recv.
Mesh: full mesh (A↔B, A↔C, B↔C).
Transport: first full transfers over LAN; ongoing over WireGuard/Tailscale/SSH.
Plain-English targets: keep data loss small; I can live with rebuild < ~1 week in a disaster.
Concern: my site will have a special vdev; the others likely won’t. Any caveats sending from special→non-special pools (e.g., where special-class blocks land on the target; performance side-effects)?
Encryption: I want encrypted off-site backups so each person’s data remains private. Views on dataset-level encryption + raw (zfs send -w) sends vs alternatives?

What I’m hoping you can help me decide (and why)

Vdev layout for resilience-first (still need ≥80 TB usable):
Would you stick with 7×2-way mirrors + spare for IOPS and rebuild behavior, or steer me to 3×(5-wide RAIDZ2) for capacity while remaining robust? Any best practices for spreading drive batches across vdevs and spare strategy (on-pool hot vs shelf)?
Special vdev details (metadata + DDT on special):
Is two 2-way mirrored special vdevs (striped) a sane failure domain, or is a 3-way special mirror worth the capacity hit for home-lab “prosumer” reliability?
What real-world starting point for special_small_blocks would you use for mixed VMs/media/photos/services? (I’m torn between 32K and 64K; I want enough metadata/small-IO acceleration without dragging too many medium files onto the special class.)
Any consumer NVMe pitfalls you’ve seen when used as special vdevs (endurance/thermal), given my FireCuda 530s (no PLP, but good TBW)?
Dedup scope with DDT-on-special:
If I limit dedup to a few datasets (shared game libraries, VM/template libraries, maybe photo duplicates), is this sane with 512 GB RAM and special vdev DDT placement? I’d love rules-of-thumb (e.g., DDT size per TB of logical data, RAM headroom) and how they would fit my profile.
VM disks on ZFS (current best practice):
In 2025 on Proxmox 9 + ZFS, would you put VM disks as sparse files on a dataset (qcow2 or raw) or on zvols?

If files: preferred dataset props (recordsize, atime, TRIM/discard) and relevant Proxmox settings.
If zvols: your go-to volblocksize, snapshot/replication considerations, and any current pitfalls.
Most importantly: what do I truly gain/lose in practice (performance, fragmentation, management) with sparse files vs zvols?

Windows “Previous Versions” from inside the VM (no SMB if possible):
I’d love for users to restore files via Explorer’s “Previous Versions” inside the VM without exposing host SMB shares. Is there any practical route (e.g., iSCSI/vdisk + host snapshots surfacing in-guest, or managing VSS inside the guest) that actually works well? Or is Samba vfs_shadow_copy2 effectively the only sane option if I want that experience?
PBS on ZFS (for VMs/LXCs/services only):

Dataset props you recommend (e.g., recordsize, compression choice, dedup=off?).
For cross-site: use today’s PBS push/sync features, or stick with zfs send/recv of PBS datastores?
A resilient but sane retention plan for homelab (I’m thinking: hourlies ~24–48 h, dailies ~14–30 d, weeklies ~8–12 w, monthlies ~6–12 m—feel free to edit).

Tunables & sensible defaults (fit for my mix):

Compression: stay lz4, or move to zstd (which level) for VM/media mix?
atime mostly off? Any exceptions you’d keep on?
Defaults you like for xattr=sa, acltype=nfs4 (Samba), dnodesize=auto, redundant_metadata=most, logbias=throughput on media datasets, etc.

Maintenance, health & monitoring (multi-site):

Even if “pre-burned,” what burn-in would you still run (SMART long, badblocks, fio patterns)?
Scrubs: monthly? More frequent at first?
SMART replacement rules: your thresholds (reallocs/pending/etc.) vs “run to failure.”
Observability: Would you run per-site Prometheus exporters with federation and a central read-only Grafana at my site, or something simpler? (I’m open—just want reliable alerts across all three sites.)

Odds & ends / context worth judging me for

I rarely expose SMB/NFS/iSCSI (I prefer Nextcloud/Seafile/Syncthing/Resilio).
I know virtualized firewalls have caveats; I’ve run them carefully for a long time.
Consumer hardware is a compromise—I welcome reality checks specific to this design.

capruro · September 17, 2025, 8:38pm

Hi
So many great points raised—thanks for sharing!

I’d like to offer a few thoughts, especially around your backup strategy, which I find a bit unclear:

Backups: local PBS first (for VMs/LXCs/services), then off-site. Media/data handled via ZFS send/recv.

From what I gather, you’re using SSDs for root volumes and spinning disks for media/data. But splitting backup strategies like this—PBS for system volumes and ZFS send/recv for media—can lead to inconsistent backups, depending on your use case.

For example, imagine you’re running Nextcloud:

The database and app files might be on SSDs (backed up via PBS),
But the user media lives on HDDs (backed up via ZFS replication).
You risk missing data or breaking consistency due to the fact that the snapshots are not in sync.

To avoid this, I’d suggest:

Either go full PBS
Or go full ZFS snapshots/replication for everything.
Even better: do both, if feasible—PBS for fast VM restores, ZFS for full consistency (if you can afford the space).

Also, consider a third layer: application-level backups (e.g., Nextcloud’s built-in export, database dumps, etc.). These can help with granular restores and reduce vendor lock-in.

Another point that could use clarification:

Mesh: full mesh (A↔B, A↔C, B↔C)

Is your site the primary production for everyone, with the other two acting as DR/backup nodes?
Or does each person run their own stack and just use the others for replication?

This distinction matters for:

Resource planning
Security boundaries
Responsibility for uptime

On encryption:
If you’re using native ZFS encryption, be aware of a critical caveat:
You cannot safely use other people’s servers as disaster recovery targets unless you fully trust them with your encryption keys. ZFS native encryption protects data at rest, but once the keys are loaded on the remote system, any admin on that system can access your data.

To maintain privacy:

Use raw encrypted sends (zfs send -w) without transferring keys.
Keep keys offline on the DR site unless you’re physically present or have a secure key management setup.
Consider client-side encryption for truly sensitive data.

And finally, a reality check on reliability:
Your setup is impressive, but keep in mind—your house isn’t a datacenter:

Power isn’t redundant,
Internet isn’t guaranteed,
And a 1-hour UPS it good! But might not be enough during a long outage (e.g., vacation or travel).

If resilience is a top priority, consider:

Remote wake-on-LAN or out-of-band management, PiKVM can be affordable,
Alerting systems that notify you of outages,
Automatic DR invocation,
And maybe even off-site cold storage for critical backups.

I hope this gives you some hints!

aeolius · September 18, 2025, 3:54am

love your thoughtful reply—really appreciate you sharing your experience. replying inline for context (and inviting anyone else reading to chime in!). after that, I’ve included a proposed (not yet created!) dataset layout that reflects everything we discussed—purely to make it easier for you to react with quick nudges rather than heavy lifting.

“Backups split between PBS and ZFS can get inconsistent.”

totally agree. i’m leaning toward consistency-first with two complementary layers:

VM/LXC layer with PBS for compact stacks (keep the whole app inside the VM/LXC so PBS captures it cleanly).
ZFS layer for spread-out or media-heavy apps (e.g., Nextcloud, Immich, Jellyfin): take coordinated ZFS snapshots across all related datasets using the same label, then replicate that label together.

timing & tooling — how we’re thinking about it (please nudge as you see fit):

snapshot/backup order: we’re inclined to run ZFS snapshots first, then PBS jobs right after, so timestamps line up naturally. if you’d flip that in practice, we’d love to hear why.
orchestration: we know Sanoid/Syncoid and zrepl are popular. if you prefer alternatives like znapzend, pyznap, zfsnap, or even a light cron + scripts approach (plus PBS’s own scheduler), we’d be grateful for your reasoning.
app quieting: for stacks with databases or busy writers, we plan to use pre/post hooks (e.g., put Nextcloud in maintenance, pause indexers, run safe DB dumps/flushes), snapshot the whole group at one label, then resume. we’re very open to your “works-every-time” pattern here.

“Either go full PBS or full ZFS… even better: do both.”

our current take on “both” (tell us if this matches your intent):

PBS for VMs/LXCs/services → then move PBS off-site (today’s PBS push/sync, or zfs send of the PBS datastore—you probably have a favorite in 2025).
Coordinated ZFS snapshots/replication for any stack that spans datasets and for bulk media.

if you would also keep a second, parallel ZFS replication of the VM/LXC storage datasets alongside PBS, could you share the pros/cons as you see them?

pros (what we imagine): independent recovery path if one method is down; easy browse of individual files; resilience if PBS metadata is unhappy; cross-tool flexibility.
cons: more space/bandwidth; higher operational complexity; retention has to be managed twice.
if your pros outweigh the cons in real life (and it won’t eat our storage for breakfast), we’ll likely adopt it.

“Is your site the primary?”

mostly yes. i’m the “more services & capacity” site; each friend runs their own private stack (e.g., a Windows CAD VM + private data) and we all cross-replicate for DR.

hub-and-spoke vs full-mesh (how we picture it, so you can go straight to advice):

hub-and-spoke example: my site = hub with wider bandwidth/retention; the other two = spokes with lighter schedules. spokes → hub often; hub → spokes for the most critical subsets; optional spoke↔spoke for just a few datasets.
priorities/schedules:
- Tier 1 (critical): configs, app databases, VM/LXC system disks → more frequent snaps/replica.
- Tier 2 (important): photos, documents, code → moderate frequency.
- Tier 3 (bulk): videos/music → daily/weekly, then monthly.
bandwidth shaping: off-hours windows at the spoke uplinks (~25 Mb/s), per-dataset schedules, and no “all-at-once” jobs.

“Encryption on other people’s servers = mind the keys.”

we’re aligned. plan: dataset-level encryption at home and raw sends (zfs send -w) off-site, with no keys stored remotely (datasets readonly=on, canmount=noauto by default). keys only loaded for DR or test-restore. for very sensitive data, we can add client-side encryption before ZFS. if you prefer offline key media, Vault-style, or hardware tokens for homelab scale, we’re all ears.

“Home ≠ datacenter.”

yep. I have IPMI and was going to play with PiKVM, nobody else has any remote kvm / ip based kvm yet. UPS + NUT for graceful shutdowns, Prometheus/Alertmanager/Grafana + external uptime checks, and we would probably want to create some documented DR steps. honesty moment: test restores will likely be 1–2×/year (if we can rally quarterly later, great). we’d love a friendly pattern for non-disruptive tests—e.g., restoring into an isolated “playground” subtree or a scratch host, validate, then tear down—so production stays untouched.

(for shared language):

LTO = tape (Linear Tape-Open).
WORM cloud = object storage with immutability (“write once, read many”).
if you’d pick one tiny off-site “cold copy” method for a homelab (LTO, a couple of shucked offline disks, or a small WORM-ish bucket), we’d love your pick—and any “verify day” you schedule to avoid quiet bit-rot.

quick ask up front — your feel for these layout dials

Vdev layout (need ≥ 80 TB usable while keeping ~20% free): leaning 7× 2-way mirrors + 1 hot spare for IOPS + quicker rebuilds… unless you’d nudge us to 3× (5-wide RAIDZ2). how would you spread mixed-batch drives across vdevs, and do you favor an on-pool hot spare or shelf spares?
Special vdev (metadata + small blocks + DDT)(6.5 TB usable space after mirroring and striping and accounting for 20% buffer): planning two 2-way mirrored special vdevs (striped). sane, or worth stepping up to 3-way on prosumer gear?
special_small_blocks: we’re weighing 32 K, 64 K, or something you like better—the aim is to speed metadata/small-IO without dragging too much medium data into special.
Consumer NVMe as special: FireCuda 530s (no PLP; 4 TB @ 5100 TBW, 512 GB @ 640 TBW). with good cooling, would you green-light them as special, or have you seen endurance/thermal pitfalls?
Fast dedupe (DDT on special) scope: we plan to keep fast dedupe selective—shared game libraries, VM/templates, maybe some photo exports that appear in multiple places. with 512 GB RAM and DDT on special, does that feel safe? any rules-of-thumb (DDT size/TB, real-world RAM headroom)?
VM disks on ZFS (2025 best practice): would you place VM disks as sparse files on a dataset (qcow2 or raw) or use zvols?
- if files: what recordsize (64 K or 128 K?), atime, TRIM/discard, and Proxmox toggles would you pick?
- if zvols: your go-to volblocksize today, snapshot/replication behavior to watch, known pitfalls.
- the real trade-offs you’re seeing now (perf, fragmentation, management) with files vs zvols would be super helpful.

proposed dataset hierarchy (idea only, not yet created)

tank/                                   # HDD pool (special vdev for metadata + small blocks)
├─ infra/
│  ├─ iso/                              # ISOs, cloud images (rare writes)
│  ├─ templates/                        # VM/LXC templates
│  └─ scripts/                          # helper scripts/Ansible/Tofu snippets (or keep in git)
│
├─ pbs/
│  ├─ datastore-main/                   # PBS datastore (namespaces per-tenant/site)
│  └─ datastore-alt/                    # optional 2nd datastore (different retention/testing)
│
├─ vm/                                   # choose ONE style below (files OR zvols)
│  ├─ files/
│  │  ├─ windows/
│  │  │  ├─ work-pc/
│  │  │  └─ cad-station/
│  │  └─ linux/
│  │     ├─ opnsense/                   # (note: OPNsense is FreeBSD-based)
│  │     ├─ home-assistant/
│  │     └─ kubernetes-node/
│  └─ zvols/
│     ├─ windows/
│     │  ├─ work-pc/
│     │  └─ cad-station/
│     └─ linux/
│        ├─ opnsense/
│        ├─ home-assistant/
│        └─ kubernetes-node/
│
├─ lxc/
│  ├─ rootfs/
│  │  ├─ nextcloud-app/
│  │  ├─ immich-app/
│  │  ├─ jellyfin-app/
│  │  └─ {all the other services as needed}/
│  └─ vol/                              # attached volumes (survive container rebuilds)
│     ├─ nextcloud-db/
│     ├─ immich-db/
│     ├─ jellyfin-db/
│     └─ {service-name}-db-or-data/
│
├─ services/                             # per-service app data mounted into VMs or LXCs
│  ├─ nextcloud/
│  │  ├─ app/                            # configs/apps (lots of small files → small_blocks=32–64K)
│  │  ├─ db/                             # DB dataset (e.g., MariaDB/Postgres)
│  │  └─ data/                           # user files (will overlap with media/photos exports)
│  ├─ immich/
│  │  ├─ app/
│  │  ├─ db/
│  │  └─ library/
│  ├─ jellyfin/
│  │  ├─ app/
│  │  └─ cache/                          # tiny metadata, posters, fanart
│  ├─ paperless-ngx/
│  │  ├─ app/
│  │  ├─ db/
│  │  └─ documents/
│  ├─ vaultwarden/ (and likewise for authentik, gitea, mattermost, technitium-dns, etc.)
│  │  ├─ app/
│  │  └─ db/
│  └─ {every other service you listed}/
│     ├─ app/
│     └─ db/
│
├─ media/
│  ├─ photos/
│  │  ├─ raw/                            # large RAWs
│  │  ├─ edits/                          # tiff/psd/jpg
│  │  └─ exports/
│  │     ├─ for-immich/                  # same exports also surfaced to…
│  │     └─ for-nextcloud/               # …Nextcloud (see dedupe/refline ideas)
│  ├─ videos/
│  │  ├─ home-video/
│  │  └─ movies-series/
│  └─ music/
│
├─ data/
│  ├─ documents/
│  ├─ projects/
│  └─ archives/
│
├─ home/                                 # user home trees (Windows & Linux)
│  ├─ windows/
│  │  ├─ users/
│  │  │  ├─ alice/
│  │  │  └─ bob/
│  │  └─ shared/
│  └─ linux/
│     ├─ users/
│     │  ├─ alice/
│     │  └─ bob/
│     └─ shared/
│
├─ shares/                               # SMB/NFS shares (if exposed)
│  ├─ smb/
│  │  ├─ homes/                          # Windows clients (Samba + Previous Versions if desired)
│  │  └─ team/
│  └─ nfs/
│     ├─ homes/                          # Linux clients
│     └─ team/
│
├─ games/                                # fast dedupe candidate: shared libraries
│  ├─ _shared/                           # parent with fast dedupe=on (DDT on special)
│  └─ clients/
│     ├─ pc01/
│     ├─ pc02/
│     └─ … up to pc10/
│
└─ friends/                              # per-tenant encrypted trees; raw sends only; no keys resident
   ├─ siteB/
   │  ├─ vm/  lxc/  services/  data/
   └─ siteC/
      ├─ vm/  lxc/  services/  data/

property ideas (so you can react without guessing our intent)

pool defaults: compression=lz4, atime=off, xattr=sa, acltype=nfs4, dnodesize=auto.
special vdev usage: let metadata benefit automatically; consider special_small_blocks=32K or 64K on services/*/app, services/*/db, lxc/vol/*-db, caches/thumbs, and maybe vm/* (if file-backed) to speed small IO. avoid routing big sequential sets (e.g., media/videos); their metadata still benefits. if you’d pick a different threshold/datasets, we’d love your take.
VM disks: if files, we’re thinking dataset recordsize=64K or 128K, TRIM/discard on; if zvols, perhaps volblocksize=16K. what would you pick today?
PBS datastore: recordsize=1M, dedup=off, compression zstd-2/3 (or lz4 if you prefer). we plan separate PBS namespaces (or datastores) per tenant/site.
fast dedupe candidates (DDT on special):
- games/_shared with fast dedupe=on; children under games/clients/* mostly reuse blocks so 10 machines don’t store 10× the same 10 TB.
- VM/templates if we keep many golden images. (VMs and LXCs themselves?)
- photo exports that appear in both media/photos/exports/for-immich and …/for-nextcloud.
  for large RAWs/videos, fast dedupe likely won’t help—leave it off there.
  alt to fast dedupe (curious for your view): if OpenZFS block-cloning/reflinks are suitable in our setup, we could use lightweight clones for those shared exports instead of dedupe. if you’ve tried this, we’d love tips.
LXC note: lxc/rootfs/* is the container OS; lxc/vol/* are attached datasets for DB/data that survive rebuilds.
“services-vm” clarification: we generally prefer one LXC per service; a “services VM” would only be for Kubernetes/Docker host purposes.

snapshot & retention sketch (so you can go straight to tuning)

group labels: e.g., nextcloud@bk-YYYYmmdd-HHMM applied to services/nextcloud/app|db|data and the lxc/rootfs/nextcloud-app (or VM) at the same minute. likewise immich@bk-…, jellyfin@bk-…, etc.
cadence (starting idea):
- VM/LXC & app datasets: hourlies (24–48), dailies (14–30), weeklies (8–12), monthlies (6–12).
- photos: similar to VM/LXC (more frequent).
- videos/music (non-home): weekly + monthly.
test restores: likely 1–2×/year now; restore into an isolated “playground” subtree/host, validate, then destroy—no touchy production. if you have a smoother drill, we’d be delighted to copy it.

replication details (so you don’t have to pull it out of us)

topology: leaning hub-and-spoke (my site hub; others spokes). spokes push Tier-1/2 to hub off-hours; hub pushes back a curated subset. optional spoke↔spoke for a few datasets only.
special→non-special: our remotes won’t have a special vdev (for years). we expect receives to land those blocks as “normal” and just work—if later restored back to a special-equipped source, they should route per current dataset properties again. does that match your experience?

thank you again for sharing your time and experience—this stuff is genuinely fun for us, and your pointers help us skip the potholes. if any choice above makes you wince, please say so; we’re happily steerable. and to anyone else passing by: your “here’s what i’d do” tips and war stories are very welcome.