Possible ZFS setup for new storage servers?

jnkraft · July 16, 2023, 10:16pm

Hello,

I have next hardware:
2 servers with 8x8TB SSD Samsung PM893 enterprise(?) sata connected via onboard sata;
2 servers with 4x12 SAS HDD WD + 4x960 SAS SSD Samsung PM1643a, via hba controller.
All servers are supermicro 8-bay units, 1xE5 2690 v4, 48GB RAM (can be increased), 2x10GB Link per unit.

One of “hybrid” SSD+HDD servers now is acting as storage node for ~100 VMs running on half-full 6 compute nodes, disk format is qcow2 format via NFS share, virt enviroment is Proxmox.
Storage scheme for this node is LVM on top of HDD mdadm raid10 + lvmcache on top of ssd mdadm raid10. It was made as a test-temp solution and as usual became “prod” somehow. In terms of performance it… well, at least it works for now.

Other 3 storage nodes are empty now and i’m in the middle of searching for best fitting configuration for described setup. Count of VM wll increase to ~600-700, then part of them eventually will move to onpremise private cloud solution, most likely CloudStack (on top of existing Proxmox enviroment or baremetal maybe). Most of VMs have “no cache” async disk profile. All hardware is UPS-backed with graceful shutdown.
VM IO load is purely random, there are dev, test, preprod, ci/cd, databases, java apps, youtrack, teamcities, gitlabs etc. Purely random.

At first i did not want to mess up with ZFS. I know it on very basic level (my homelab happily lives from zfs 0.6 to 2.1.12 as for now), without any deepdives. Supposed solution was to use LVM VDO for both hybrid and allflash storage nodes, but in short - it failed completely after testing. The only thing i needed from it was compression on allflash and vdo+lvmcache on hybrid for both data and cache compression to save iops when backwriting from cache to slow hdds.

I can always do classic mdadm+lvm for allflash and mdadm+lvmcache for hybrid. But maybe ZFS can help somehow? Faster and more intellectual rebuild and data integrity check would ofcourse be a benefit.

Could you suggest ZFS setup best fitting for described above 2 types of storage nodes and max possible storage performance? I know words like ARC, ZIL, SLOG but have no real experience with it. Or maybe leave it all as it is with classic solutions?

mercenary_sysadmin · July 17, 2023, 5:16pm

Most of VMs have “no cache” async disk profile

Does this mean you’re ignoring sync() requests on those VMs, and just doing all writes as async regardless of what the VM asked for?

2 servers with 8x8TB SSD Samsung PM893 enterprise(?) sata connected via onboard sata

You’re likely leaving a lot of performance on the table sticking with mobo SATA here. You did say Supermicro, so there’s at least a chance your “mobo SATA” is actually a proper LSI SAS chipset. But IME, the majority of even Supermicro boxes–especially the ones with only eight bays–use a more generic SATA/SAS chipset that doesn’t perform noticeably better than consumer “mobo SATA” does.

If you can’t move 1.5GiB/sec or more off those arrays on the least challenging workload possible, that confirms that you’re leaving (a lot) of high-end performance on the table by not adding a proper LSI HBA, which opens up that top end considerably.

2 servers with 4x12 SAS HDD WD + 4x960 SAS SSD Samsung PM1643a, via hba controller

You’re probably going to want separate pools on these machines, rather than trying to use those large nice SSDs just to mitigate the performance issues of a very small number of HDDs.

The potential exception here is for a LOG vdev (aka “SLOG”) to accelerate sync writes. But if the VMs that have “no cache async disk profile” live here, and I understood correctly that this means sync requests aren’t honored for those VMs, you don’t need the LOG vdev–you can simply zfs set sync=disable on the datasets where those VMs live.

Could you suggest ZFS setup best fitting for described above 2 types of storage nodes and max possible storage performance?

If you want “max possible storage performance” the most important thing is that you go with mirrors, not RAIDz. On both your SSD and rust pools, you should be running pools of two-way mirror vdevs. The difference in performance on challenging workloads is enormous.

You’ll also generally want small recordsizes, if you’re running a ton of random I/O on these VMs. Ideally, you match the zfs recordsize to workload–so a MySQL DB gets 16K to match its page size (or possibly 32K to increase compression ratio), a PostgreSQL DB gets 8K to match its (or possibly 16K, see previous note), and a bulk storage workload would get 1M (since it has no random access within individual files).

Of course, you did say Proxmox, so I’m not sure how you’re setting this up exactly. Proxmox usually wants to do zvols, but since you’re not doing ZFS with it at all right now, I’m not 100% sure what your options are moving forward. If you’re doing iSCSI storage transport, you can use either zvols or flat files as the targets–and you should do volblocksize=8K, 16K, or 32K as above. I wouldn’t recommend going larger than that, if you’re using zvols.

Compress=lz4 (zstd gets better ratios, but I keep seeing iffy performance by comparison). atime=off. Set up a cronjob to run zpool trim on the SSD pools daily, at whatever the closest thing to an idle period you have is.