Advice on KVM with Networked ZFS Storage

Hi,

I’m rebuilding my lab/production environment looking for a sounding board on the approach.

For context I run all my own services and store all my families data so integrity and recoverability of data is critical to my family. The services I operate mostly impact me but my priority is for a speedy recovery.

I currently have 3 Epyc servers but could scale down to 1. I use these servers to run KVM with a range of workloads with data being backed up daily using restic to my local TrueNAS server. For the virtual machines I use bootc so I can recover the VM’s quickly and restore data from the last backup. My critical data like family photos lives on my TrueNAS server exposed over SMB with snapshots taken hourly with daily snapshots stored with 3 years of roll back. These are replicated to 3 other TrueNAS servers around the country.

My current thinking is that I can scale down to a single Epyc server with my local TrueNAS Mini X+ over a 10GB for network based storage. I am thinking about;

  • having the operating system virtual machine qcow2 on the TrueNAS server (these are mostly be read only using bootc with the files stored in git)

  • the /var/x mount is my only persistent volume which will also be on the TrueNAS server

    • would you recommend attaching a second qcow2 for /var/x or just mounting an NFS volume?
  • I currently backup all my data daily with scripts that do things like stop containers like redis, mariadb, nextcloud, quay registry and minecraft servers and backup the persistent data using restic to the TrueNAS server

    • To reduce my window for data loss I am hoping that my being on the TrueNAS server I can snapshot the datasets rather than just the daily backups using restic. Other then testing, what is the recommended approach to reliably understand whether my services will tolerate snapshots without the risk of corrupting the data?
  • My last question is about performance. I really enjoy my lab so money isn’t the biggest concern. I have read the Klara posts about ZFS VDEV layout and configuration for workloads which I will align with.

    • Any one able to provide experience on performance? I intend on housing the data on either mirrored SSD or NVME. Again I can use my existing TrueNAS Mini X+ with these options or re-purpose an Epyc server at which point I have waay for PCIE lanes and fast disk options to play with. My priority here is being able to enjoy using my lab, not being frustrated waiting for things to build etc.
  • The TrueNAS Mini X+ (Supermicro A2SDi-8C±HLN4): currently has 4 spinning HDD’s

  • A secondary TrueNAS server (Supermicro A2SDi-16C-HLN4F ) has 6 Crucial MX500 SSD’s

  • The Epyc Servers use the ROMED6U-2L2T with an Epyc 7302 a single NVME local storage.

Also open to feedback - I want to be as professionally responsible here whilst not being over the top.

Appreciate the advice,

Adam

First question is: why separate compute and storage? The storage transport network is a SEVERE bottleneck in your intended application. Running hyperconverged eliminates that bottleneck entirely, and simplifies setup as well.

Second note: apologies if this is romper-room advice, but you need to pay SERIOUS attention to both PCIe lanes and CPU firepower if you want to get the most out of 10G networking… and even more so, when you need to get the most you can out of single-threaded networking. I’ve experienced difficulty breaking more than 4-5Gbps on a single network thread (eg an SSH pipe or a single SMB file transfer) even on pretty recent and powerful server CPUs.

Any one able to provide experience on performance? I intend on housing the data on either mirrored SSD or NVME.

When your storage is on the other side of a LAN, the bare metal storage–whether SATA SSD or NVMe SSD–is extremely unlikely to provide a significant bottleneck. Mirrors are still the correct answer, because RAIDz can cause noticeable latency bottlenecks… but even those will be added to the inescapable storage transport bottleneck.

It’s a LOT easier to get 2GiB/sec out of your local storage than it is to get 1GiB/sec out of network storage, even on a 10G LAN.

Essentially, I’d recommend simplifying down to a single server running the KVM hypervisor and an OpenZFS storage pool for “production” (where your VMs actually live and run in normal circumstances). This eliminates storage transport bottlenecks AND frees up a lot of that 10G network, which I have not heard you say is actually entirely discrete from your data transport network, for serving the actual data the VMs produce to the end users, rather than having to consume double the throughput for every block read or written: once from client to VM, again from VM host to storage host.

If you put all your VMs in child datasets beneath a single parent dataset–we’ll call that parent pool/images–you can then create pool/images/qemu, then rsync -a /etc/libvirt/qemu/ /pool/images/qemu/, then mv /etc/libvirt/qemu /etc/libvirt qemu-dist and finally zfs set mountpoint=/etc/libvirt/qemu pool/images/qemu.

Now, your VM definitions are stored beneath the same parent dataset as their storage–and that means a single command run from a backup server, root@backup:~# syncoid -r root@prod:pool/images pool/images backs EVERYTHING that matters up to your backup server. Best part? If you installed the KVM packages on that backup server, you can promote all those “backups” in place to running, production VMs very easily:

root@backup:~# for vm in `ls pool/images/qemu/*xml` ; do virsh define $vm ; done

Now, comment out the crontab entry you presumably have running that backup command (reminder: syncoid -r root@prod:pool/images pool/images) to make sure you don’t clobber your newly-promoted-to-prod VMs, and you can virsh start each of them. Not only will they immediately fire up and run, your users won’t even know the host has changed–as long as both prod and backup are on the same local subnet, the VMs will automatically pull the same IPs they always do from the backup host’s network bridge, and that’s it–no further configuration!

The final thing I’ll mention is that hourly replication to a local hotspare is amazing, but you know what’s even better? Hourly replication to a local hotspare plus daily replication to an offsite disaster recovery box, which you can otherwise manage precisely the way you did the hotspare, and with all of the same options–although, if you lose the site the prod and hotspare are on, you’ll first need to drag the disaster recovery box to your users and make sure it’s on the same IP subnet that prod and hotspare used to be, so that the VMs’ IP addresses will still be reachable by the users.

Now, you might not need or want offsite backup for all of your stuff. But keep in mind, you can also choose to do offsite backup for some of it, just the things you are most concerned with making sure you still have even if the house burns down. That doesn’t even necessarily have to be ZFS–maybe you just install the iDrive or Carbonite or Backblaze client inside one of your VMs, and do a very very traditional cloud-based backup for just those important files, on a per-file level. That’s up to you.

If your VM filesystems are journalling (which all modern filesystems intended for primary storage are–ext4, XFS, NTFS, UFS2, HFS, APFS, etc etc etc. Basically just the cheap-portable-drive intended filesystems like FAT32/VFAT/exFAT are non-journalling these days), then you’re solid in terms of filesystem corruption: they’re crash-safe, which also means they’re snapshot-safe.

If your database engines are journaling–eg MSSQL, MySQL InnoDB (but not myISAM!), PostgreSQL–then they are also crash-consistent, and therefore will be snapshot-consistent as well.

FINALLY, if your service applications which use those journalling DB engines are built ACID compliant, that means they are transactionally safe, and also both crash-safe and snapshot-safe.

Please note, these are all the same considerations you have for a system which loses power or experiences any other kind of crash, and in 2025, anything not crash-safe should be considered a serious bug, and if the devs are unwilling to fix it, you should really be looking at alternatives!

One final note: while it’s of course quite possible for a DB application which isn’t ACID-compliant (meaning dependent queries are wrapped in transactions, so that either all queries in the transaction complete or NONE do) to become application-level inconsistent after a crash–despite the filesystem and DB engine beneath it themselves having maintained consistency–this is usually pretty unlikely on a relatively lightly-loaded server, as one would expect something in a homelab typically to be.

In practice, I’ve experienced application-level database inconsistency beneath crash-consistent filesystem and DB engines maybe two or three times in a sysadmin career spanning decades. One of those times, I was using OpenZFS–and I just rolled back to the next snapshot prior to the one with the application-level inconsistency, which did cost me an extra hour on my RPO… but resolved the issue immediately, adding no more than seconds to my RTO.

2 Likes

Thanks for taking the time to write an extensive response Jim, I really appreciate it. I just became fixated on the “over the network” storage and I agree that a hyper-converged approach is better. On a positive note it will also give me opportunity to try syncoid and sanoid which may wean me off TrueNAS to have a common deployment pattern across all my systems.

For professional development reasons my preference to is remain in the RHEL/CentOS/Fedora family. I understand ZoL is well supported on RHEL but I havent tried it myself.

Regarding your comments re journaling filesystems and ACID compliant databases - that is super interesting and really opens up a whole new world for me for both RPO and RTO. Excited to design this out and get building.

Thanks again Jim,

Adam

1 Like