ZFS pool setup and disk layout

tenkara_vt · July 1, 2024, 6:30pm

I’m trying to make sure that I am using the best setup with what I have at the moment. I currently have the following:

Ubuntu base OS on 500GB SSD

3-wide stripe of 4GB HDDs with a double mirror that looks like the following

NAME              STATE     READ WRITE CKSUM
my-pool  ONLINE       0     0     0
 mirror-0        ONLINE       0     0     0
   02            ONLINE       0     0     0
   03            ONLINE       0     0     0
   04            ONLINE       0     0     0
 mirror-1        ONLINE       0     0     0
   05            ONLINE       0     0     0
   06            ONLINE       0     0     0
   07            ONLINE       0     0     0
 mirror-2        ONLINE       0     0     0
   08            ONLINE       0     0     0
   09            ONLINE       0     0     0
   10            ONLINE       0     0     0
cache
 nvme0n1         ONLINE       0     0     0

The above currently has a single vdev and is used for running VMs via qemu/kvm for my small business. As I am using it for that purpose, I followed the recommendations of this article from Klara Systems. I have also since read this response to a somewhat, although not completely, related question and thinking I should implement the small change as referenced.

I’m not worried about any crazy optimizations. This is a small business with 2 people accessing the VMs and data and there is no extreme IO. There are production VMs of Nextcloud, project management software, invoicing/time tracking software, Omada controller, pi-hole, and a bunch of test VMs.

My main concern with the setup above is if I loose one disk in mirror-0, then I loose the pool. Or is that incorrect?

Any suggestions relating to my setup are appreciated.

In the future, I was thinking about a mirror of raidz1 using bigger disks, but want to fully understand what I’m doing before buying them.

bladewdr · July 1, 2024, 7:01pm

My main concern with the setup above is if I loose one disk in mirror-0, then I loose the pool. Or is that incorrect?

Redundancy in ZFS is at the vdev level, not the pool level.

With a 3 way mirror, you can lose up to 2 disks per vdev with no data loss, but you’ll see a read performance hit.

I don’t usually recommend L2ARC, unless you physically can’t add more RAM to that system.

tenkara_vt · July 1, 2024, 8:04pm

Thanks @bladewdr. Good to know about the L2ARC. It currently has 256GB, with room for another 64 if I wanted. But for what I’m doing, that’s sufficient.

How do I go about removing the L2ARC?

mercenary_sysadmin · July 1, 2024, 9:15pm

Extremely incorrect. What you’ve shown us is a pool of nine drives broken into 3x three-wide mirror vdevs. This means you can lose any two drives without losing any data, and your odds are pretty good for being able to lose more than that.

You don’t lose a pool comprised of mirror vdevs unless you lose all disks in any one of those vdevs. So you could lose, eg, disk 02, 03, 05, 07, 09, and 10 without losing any data: but if you lost 01, 02, and 03 you’d lose the pool.

Any suggestions relating to my setup are appreciated.

It’s pretty wasteful in terms of capacity, honestly. Do you have actual backup in addition to this server? If not, you really ought to worry less about redundancy in this machine and more about your RTO and RPO (Recovery Time Objective and Recovery Point Objective), which usually means a second server this one replicates to regularly.

a mirror of raidz1

That’s not actually a thing. Here’s how it works: the pool is a JBOD comprised of vdevs. The vdevs are whatever the vdevs are, comprised of disks.

So right now, you’ve got a JBOD of mirror vdevs. You could, for example, have instead configured the same drives as a JBOD of 3-wide RAIDz1 vdevs, which would look like this:

root@elden:/tmp# for d in {0..9} ; do truncate -s 1G $d.raw ; done
root@elden:/tmp# zpool create demopool raidz1 /tmp/0.raw /tmp/1.raw /tmp/2.raw raidz1 /tmp/3.raw /tmp/4.raw /tmp/5.raw raidz1 /tmp/6.raw /tmp/7.raw /tmp/8.raw
root@elden:/tmp# zpool status demopool
  pool: demopool
 state: ONLINE
config:

	NAME            STATE     READ WRITE CKSUM
	demopool        ONLINE       0     0     0
	  raidz1-0      ONLINE       0     0     0
	    /tmp/0.raw  ONLINE       0     0     0
	    /tmp/1.raw  ONLINE       0     0     0
	    /tmp/2.raw  ONLINE       0     0     0
	  raidz1-1      ONLINE       0     0     0
	    /tmp/3.raw  ONLINE       0     0     0
	    /tmp/4.raw  ONLINE       0     0     0
	    /tmp/5.raw  ONLINE       0     0     0
	  raidz1-2      ONLINE       0     0     0
	    /tmp/6.raw  ONLINE       0     0     0
	    /tmp/7.raw  ONLINE       0     0     0
	    /tmp/8.raw  ONLINE       0     0     0

This pool also uses nine drives total split into three vdevs, but the individual vdevs are three-wide RAIDz1. You can afford to lose one drive from any vdev in this case, but if you lose a second drive from the same vdev, you lose the pool with it.

This pool would offer significantly higher write performance than your current pool and significantly more available storage (assuming 4GB drives, you get roughly 12GB now and would get roughly 24GB out of this pool of RAIDz1 vdevs).

What you should absolutely not do is put all nine drives in one big RAIDz1 vdev. It’ll perform terribly, won’t get you anywhere near as much available storage as you might naively expect, and it won’t be anywhere near fault tolerant enough.

Essentially, RAIDz1 is only acceptable at the three-wide level. Any more than that, you need to be looking at RAIDz2 and dual parity instead.

tenkara_vt · July 1, 2024, 9:33pm

Great. Thank you. That actually makes sense. My misunderstanding was that the pool contained the vdevs, somewhat akin to a directory within a directory. Your explanation makes way more sense than how I initially understood it.

In the works. You actually just responded to me about it here. Or at least what I’m trying to work on. My backup only has room for 1 HDD - which is currently a 14GB HDD.

That was my guess, but I didn’t quite know why. Not to mention, I wasn’t sure how to set it up to best use the space AND make use of the aforementioned backup.

Got it. That makes sense now. I incorrectly understood how vdevs worked.

Great in theory, but as backups grow in size, my backup box will be too small. I have another box that holds 4 drives, but don’t have drives for it at the moment. How do I best balance that?

Adding to all of this, I bought a 12 bay Supermicro server (with an LSI HBA) that I wanted to transfer the 9 disks into, also providing room for expansion. because of this, I was trying to figure out the best setup.

mercenary_sysadmin · July 1, 2024, 9:46pm

There are essentially two philosophies when it comes to building backup boxes: you either build them to exactly the same spec as the production box so that it can be pressed into service in production directly if prod falls over, or you build the backup for lower performance but higher capacity than prod.

As an example of the latter, let’s say you’ve got two twelve-bay machines, each with 12x 12TB drives available. You might build prod0 with six two-wide mirror vdevs (rough usable capacity of 72TB) and dr0 with two six-wide RAIDz2 vdevs (rough usable capacity of 96TB).

Or, perhaps you only buy ten drives for dr0, and it gets a single 10-wide Z2 vdev with the same roughly 96TB usable capacity, but significantly lower performance and slightly lower failure tolerance (since losing any two drives loses the vdev, but when you had two vdevs, you had to lose two drives from the same vdev).

Personally, I most commonly build identical systems, and usually in topologies of three: onsite production, onsite hotspare, and offsite disaster recovery. My breakfix flow looks like this:

Production box fails (and can’t be resolved with a simple rollback): demote production, promote hotspare, clients can immediately get to work on VMs running on what used to be the hotspare box with no reconfiguration
Site-wide failure: drag the offsite dr box into a construction trailer with a bunch of laptops from Wal-Mart, people can immediately work in production on VMs running directly on what was formerly offsite DR

On the other hand, if you build under-spec hotspare/disaster recovery boxes, your RTO (Recovery Time Objective) gets significantly longer, because you have to actually restore data from the backup box to a production-capable box before your users can get back to work.

Side note: I am available for direct, professional consultation and implementation, if you’d like more direct and/or private guidance and/or hand-holding.

tenkara_vt · July 1, 2024, 10:08pm

Thanks, yet again for an additional explanation. Very helpful.

A majority of this is self-taught over the past 10-15 years and had been just stuff I played around with at home. Over the past year, I’ve brought things into my small business. At the moment, there are 2.5 of us that are reliant on the system, so my RTO is not high. Until I get this dialed in, our most critical services are still going through SaaS vendors, but I really want to move away from that to open source and stuff we host on-site.

I think I’ll work up to the triplicate setup.

tenkara_vt · July 4, 2024, 2:17am

So how do I go about removing the L2ARC from a pool?

I’d previously set it up that way, but after moving the pool, it’s now showing in a degraded state because I didn’t move the L2ARC. Which makes sense. But with the amount of RAM I have, I’d guess it wasn’t a problem.

Edit: After reading the man pages, is it as simple as zpool remove cache cachedisk?

mercenary_sysadmin · July 4, 2024, 12:28pm

Yep. It’s just that simple!

bladewdr · July 4, 2024, 5:16pm

Just to put this into perspective, there’s nothing WRONG with having an L2ARC, just that it’s usually better to add more RAM as it will be more performant.

It’s a second chance cache for things that fall out of the ARC, so your hit ratio on it will likely be very low.

If you’ve already maxed out your memory and the system doesn’t have that much overall, it can sometimes be beneficial to have the L2ARC, but if you have 64GB of RAM or more in a home use scenario, it’s probably unnecessary and won’t net you any read performance benefits.

tenkara_vt · July 8, 2024, 2:50pm

Hmm, since I’ve moved my pool to another server that doesn’t have the cache disk in it and tried removing the L2ARC disk using the aforementioned command, I’m getting the following message:

cannot open 'cache': no such pool

Which does make sense due to the above. But the question remains, how do I remove it and/or fix it so it’s no longer in a degraded state?

quartsize · July 8, 2024, 3:01pm

First arg to zpool remove is the pool name, not the vdev type. Then if it’s not recognizing the name of the cache disk because it isn’t present, you can try using the GUID listed by zpool status -g.

tenkara_vt · July 8, 2024, 3:17pm

Thanks! I just learned about the zpool status -g while trying to figure it out, right before seeing your response. I was getting ready to try removing it by guid but didn’t want to screw up anything.

I had forgotten to include my pool name when originally running the command, so it may have initially worked, had I done it correctly. However, removing it by finding the guid was successful and it’s no longer in a degraded state.

Still learning how to fully use openzfs, so it’s a process.

Thanks again for the help!