Anyone keep a cheet sheet on resource 'costs' for different zfs 'layouts' (braingame?)

Bjorn.S · September 3, 2023, 5:52am

so i was thinking of how to design the ‘next’ zfs system it’s likely going to be aarch64 based and built to be extended for decades to come

its likely a couple of years away before build starts but intel gathering can never start to early…

trying to keep this as generic as possible so it might be of use to someone else…

anyway issues like planning is required since we’re stuck when we start…
like growing zraidY or draid is not possible without extra hardware, of-course extra vdev but not going from a 3disk zraid1 to 6disk zraid 2 or 9disk zraid3, this is simple with small stuff, but when size gets big it gets expensive to grow the system, not just money, but i would not like a 3 disk zraid1 with say 22TB drives, on a sata2 interface, and adding a second vdev so you got 2x 3d-zraid1 in a stripe instead of 1 6d-zraid2, done that, still got the scars, and remember the nightmares… it was not a problem when disks where a couple of TB, but they are getting pretty big pretty fast now days…

also with layouts i’m thinking potencial hardware bottlenecks like Sata2/Sata3/PCIe lanes/HBA cards…

it would be nice with some kind of score system for a zfs system, so its possible to see when the cost gets more expensive then to just a total rebuild. (adding stuff and scores don’t scale anymore)

things like zfs-blockdev (like some people use for swap, ugh…) + iscsi for example, bad idea?

the idea i got, maybe stupid or not who knows?
is to use dedicated network for zfs communication between nodes in draid setup.
this way a node can start small and extend in a layerd approach…

each “network” node got its own local zfs system that only expose a single blockdevice.

if we only have the starting system, (think proof of concept)
we start with 3 raspberry pi’s 1 without disk, and 2 with one disk each…

node 1 is interface node, we start it as a draid with 40 iscsi ‘disks’ (distributed on the other two nodes)
node 2 and 3 have 20 zfs-blockdevs on their pools (say they are 100GB a piece for simplicity)
node 1 got a 40d-draid of 2TB now.

if we now grow the disks, and bump all these blockdevs to 200GB, cycle the node to make the interface degrade, the interface should be able to pick up a few new nodes each iteration and expand-fs to 4TB, invisible to the outside network and their ‘files’.

the idea with blockdev + iscsi is that when out of disk on one node, we just introduce another node rpi node 4 with a disk double the size of node 2 and 3, and half of the blockdevs and their snapshot history over to that node from both other nodes. that gives us the ability to grow the blocks once more the double size, and expand the interface to 8TB…

since draid is distributed it would likely be at-least in theory possible to run on cheap hardware like a rpi (aarch64) with USB-3.2 disk or even better something like a rpi (aarch64) with a sata3 port or nvme with since losing a few nodes does not matter that much, since its a sharded setup.

and in any case, 20 rpis with ‘small’ ssd drives are a lot cheaper than building a single node with 4 x ‘small * 5’ sized ssd’s (and it scales better when you grow it so each node is somewhere round 20TB+ on ssd’s) that can be done with other hardware than rpi like SBC or just put another node as an interface, and put 4 rpis below one rpi…

network will eventually saturate, but then even that can be partitioned in other ways…
its a more component based approach…

now the big question i cant find a answer for…
how costly is rebuilding & scrubbing on draid compared to a larger zraid(1,2,3) and a single disk.

the interface draid node maybe needs to be something a bit more massive then a rpi…

at-least in theory this approach seems like it should be a possible solution for growing zfs backends without the normal headaches of buying things you need 2-4 years down the path and when that day comes the need has exponentially increased so it was a stupid move from the beginning resulting in a total rebuild, or performance degradation.

Bjorn.S · September 5, 2023, 5:12am

guess this is a topic without any interest?
actually at-least expected someone to chime in on it in a few days…

that is a score system, it would really be the start to a way to benchmark designs,
so different design layout variants could be objectively measured against each other.