Three-site Proxmox + ZFS (15×16 TB/site, special vdev, PBS, encrypted off-site) — layout sanity check & best-practice questions

Hi everyone!
Three friends are building mirrored homelabs and backing up to each other. I’d love a sanity check and your best practices. I’m a tinkerer, not an expert—please be candid (flame me if you must, but be kind :sweat_smile:).

Priorities & workloads

  • Priority: Resilience > Performance > Capacity (target ≥80 TB usable on the main site after ~20% free space is kept).
  • Workloads (Proxmox 9):
    • Virtualized UTM (been virtualizing pfSense for ~10 years; OPNsense for ~3 years).
    • Proxmox Backup Server (PBS) on the same host (for VMs/LXCs/services; media/data handled separately via ZFS send/recv).
    • Media & general file storage, family photos.
    • Windows 11 “work PC” VM; SPICE (and NoVNC) are the only remotes that keep working while the guest is on corporate VPN (I believe Proxmox proxying helps).
    • Light databases for self-hosted services; some AI tinkering.
  • Compression: leaning lz4 unless you convince me otherwise (open to zstd guidance).

Hardware (my site; the other two are similar but smaller)

  • Chassis: Rosewill 4500 (15×3.5″), strong airflow.
  • CPU/RAM: AMD EPYC 7532, 512 GB ECC (8×64 GB—unlikely to grow; larger DIMMs are too pricey).
  • Motherboard: ASRock Rack ROMED8-2T/BCM; all PCIe slots support x4/x4/x4/x4 bifurcation.
  • HBA: Broadcom/LSI 9400-16iSATA Exos HDDs.
  • NIC: Mellanox ConnectX-4 (MCX4121A-ACAT) 25 GbE dual SFP28.
  • Spinning disks: 15 × 16 TB Seagate Exos (mix of batches, some recerts; all are 16 TB despite “Exos 16/18” naming).
  • NVMe (consumer): 4 × 4 TB FireCuda 530 + 2 × 512 GB FireCuda 530. No PLP; endurance: 4 TB = 5100 TBW, 512 GB = 640 TBW.
  • Other SSDs in the group: one site has Samsung 983 DCT M.2 960 GB (as I understand, with PLP); another site has unknown consumer Samsung “Pros.”
  • Power: my site = hours of UPS; the others ≈5–15 min.
  • LAN/WAN: my LAN mostly 1 GbE (UniFi Switch Pro 24 PoE with 2×10 GbE); WAN 1 Gbps symmetrical. Other sites ~50–150 Mbps down / ~25 Mbps up. For “fast restore,” plan is direct 25 GbE DAC between servers when physically co-located.

Pool design I’m considering (please critique!)

  • Main HDD pool: 7 × 2-way mirrors + 1 hot spare (14 data, 1 spare).
    • Aiming to clear ≥80 TB usable with ~20% free kept.
    • Mirrors for IOPS and fast rebuilds. I might later convert some to 3-way mirrors as larger disks arrive—but not now.
  • Special vdev (metadata/small files + DDT for “fast dedupe”): 4 × 4 TB FireCuda 530 arranged as two mirrored special vdevs (i.e., special class = stripe of two mirrors).
    • Tentative special_small_blocks = 32K or 64K to accelerate filesystems/VMs/service configs and small media metadata.
    • I know lose special = lose pool; hence mirrored special vdevs, cooling, and frequent replication.
  • Boot: 2 × 512 GB FireCuda 530 mirrored for Proxmox boot (may also hold ISOs/container images in a small dataset).
  • SLOG / L2ARC: No SLOG (I don’t expect sync-heavy exports) and no L2ARC (512 GB RAM + special vdev should suffice).
  • Datasets: Prefer many datasets to tune per use case (recordsize/volblocksize, atime, special_small_blocks, dedup).
  • Dedup: Not pool-wide. Considering it only where it truly pays (shared game libraries, VM/template libraries, maybe photo duplicates across project datasets). I’d love rules-of-thumb that fit 512 GB RAM and consumer NVMe special vdevs.

Replication, encryption & PBS

  • Backups: local PBS first (for VMs/LXCs/services), then off-site. Media/data handled via ZFS send/recv.
  • Mesh: full mesh (A↔B, A↔C, B↔C).
  • Transport: first full transfers over LAN; ongoing over WireGuard/Tailscale/SSH.
  • Plain-English targets: keep data loss small; I can live with rebuild < ~1 week in a disaster.
  • Concern: my site will have a special vdev; the others likely won’t. Any caveats sending from special→non-special pools (e.g., where special-class blocks land on the target; performance side-effects)?
  • Encryption: I want encrypted off-site backups so each person’s data remains private. Views on dataset-level encryption + raw (zfs send -w) sends vs alternatives?

What I’m hoping you can help me decide (and why)

  1. Vdev layout for resilience-first (still need ≥80 TB usable):
    Would you stick with 7×2-way mirrors + spare for IOPS and rebuild behavior, or steer me to 3×(5-wide RAIDZ2) for capacity while remaining robust? Any best practices for spreading drive batches across vdevs and spare strategy (on-pool hot vs shelf)?
  2. Special vdev details (metadata + DDT on special):
    Is two 2-way mirrored special vdevs (striped) a sane failure domain, or is a 3-way special mirror worth the capacity hit for home-lab “prosumer” reliability?
    What real-world starting point for special_small_blocks would you use for mixed VMs/media/photos/services? (I’m torn between 32K and 64K; I want enough metadata/small-IO acceleration without dragging too many medium files onto the special class.)
    Any consumer NVMe pitfalls you’ve seen when used as special vdevs (endurance/thermal), given my FireCuda 530s (no PLP, but good TBW)?
  3. Dedup scope with DDT-on-special:
    If I limit dedup to a few datasets (shared game libraries, VM/template libraries, maybe photo duplicates), is this sane with 512 GB RAM and special vdev DDT placement? I’d love rules-of-thumb (e.g., DDT size per TB of logical data, RAM headroom) and how they would fit my profile.
  4. VM disks on ZFS (current best practice):
    In 2025 on Proxmox 9 + ZFS, would you put VM disks as sparse files on a dataset (qcow2 or raw) or on zvols?
  • If files: preferred dataset props (recordsize, atime, TRIM/discard) and relevant Proxmox settings.
  • If zvols: your go-to volblocksize, snapshot/replication considerations, and any current pitfalls.
  • Most importantly: what do I truly gain/lose in practice (performance, fragmentation, management) with sparse files vs zvols?
  1. Windows “Previous Versions” from inside the VM (no SMB if possible):
    I’d love for users to restore files via Explorer’s “Previous Versions” inside the VM without exposing host SMB shares. Is there any practical route (e.g., iSCSI/vdisk + host snapshots surfacing in-guest, or managing VSS inside the guest) that actually works well? Or is Samba vfs_shadow_copy2 effectively the only sane option if I want that experience?
  2. PBS on ZFS (for VMs/LXCs/services only):
  • Dataset props you recommend (e.g., recordsize, compression choice, dedup=off?).
  • For cross-site: use today’s PBS push/sync features, or stick with zfs send/recv of PBS datastores?
  • A resilient but sane retention plan for homelab (I’m thinking: hourlies ~24–48 h, dailies ~14–30 d, weeklies ~8–12 w, monthlies ~6–12 m—feel free to edit).
  1. Tunables & sensible defaults (fit for my mix):
  • Compression: stay lz4, or move to zstd (which level) for VM/media mix?
  • atime mostly off? Any exceptions you’d keep on?
  • Defaults you like for xattr=sa, acltype=nfs4 (Samba), dnodesize=auto, redundant_metadata=most, logbias=throughput on media datasets, etc.
  1. Maintenance, health & monitoring (multi-site):
  • Even if “pre-burned,” what burn-in would you still run (SMART long, badblocks, fio patterns)?
  • Scrubs: monthly? More frequent at first?
  • SMART replacement rules: your thresholds (reallocs/pending/etc.) vs “run to failure.”
  • Observability: Would you run per-site Prometheus exporters with federation and a central read-only Grafana at my site, or something simpler? (I’m open—just want reliable alerts across all three sites.)
  1. Odds & ends / context worth judging me for :sweat_smile:
  • I rarely expose SMB/NFS/iSCSI (I prefer Nextcloud/Seafile/Syncthing/Resilio).
  • I know virtualized firewalls have caveats; I’ve run them carefully for a long time.
  • Consumer hardware is a compromise—I welcome reality checks specific to this design.