Mirror pool expansion -- opinions sought

karl · September 11, 2024, 12:52pm

TL;DR: I have a mirror vdev; what would you do to expand this.

I have x2 4 TB hard drives in a mirror, which is 54% full. I am planning for the future so have time to consider and take advice. Once my pool gets to 85% (could take 2-3 years) what would be best. This sever is turned on once a week to run a backup.

Option 1 – add two more drives and create another vdev

mirror0
- disk 1
- disk 2
mirror1
- disk 4 (new)
- disk 5 (new)

Option 2 – expand (replace) existing pool.

Purchase two 5 TB drives or higher.

Option 3 – create a separate pool and balance datasets across

pool
- mirror0
  - disk 1
  - disk 2
pool2
- mirror0
  - disk 1
  - disk 2

Option 4 – buy one 10 TB WD Easy Store external hard drive and format to exFAT.

Pros / Cons

Option 1
I get additional capacity, however if the “wrong” two drives die I will lose entire pool. Higher risk of loss. Advantage is that I can use existing hard drives and pay less for two more.

Option 2
Cost more money to purchase more drives and each vdev is stuck at higher capacity, so need to commit to buying larger drives upon failure. I could replace failed drives with larger capacity when/if they die as a slow upgrade path. Keeps things as is, which is easier to manage and fewer hard drives to monitor.

Option 3
Need to manually rebalance datasets – could be less space efficient. But less risk of entire pool being lost if two of the wrong drives die. Also, dataset spread so if one pool has problem less data/time to restore from backup. Could have higher risk of data loss with complexity of data on two seperate pools.

Option 4
Everything in one place on one drive stored on a rock solid file system - no disadvanges here.

I am not asking with answer in my head hoping for an echo chamber, I am genuinely puzzled. I seem to more strongly consider options 1 and 3. But given I am using 4 TB drives, I have lots of higher capacity drives I can go to, but with larger drives more time to resilver and scrub as data grows on one pool.

I have recently rebuilt my backup server to allow 6 drives (space in case and server motherboard with 6 SATA ports, 4 of which 6.0 Gbps, and has 32 GB ECC RAM - first time I am using ECC so excited for that). Took me a while to find the right case.

All data on this pool are backed up to another drive (x2) and in the cloud (x2); this is in addition to “primary data” stored on a network drive.

Thank you.

tvcvt · September 11, 2024, 1:41pm

I’m a fan of option #1. Besides the added resiliency, there are performance gains to be had from multiple mirrors.

If you’re concerned about the data being balanced across the mirrors, check out this write-up from @mercenary_sysadmin a few years ago.

Topslakr · September 11, 2024, 5:09pm

For me? Option 2 - No question.

Those existing drives won’t last forever, and even if your pool never grows so large that you need more capacity, swapping them out in a few years would be wise.

Adding more disks only adds more opportunities for failure, power use, complexity.

Drives continue to cost less over time per GB, so it’s likely that when the time comes you’ll be able to get new disks for the same money you paid for the drives you have now.

I have spent $200/disk for … decades. First they were 80GB, then 250GB, moved to 1TB, 4TB, 8TB, and I’m on 14TB and 16TB disks now.

HankB · September 11, 2024, 5:32pm

Agreed. I’ve done this repeatedly. My suggestion:

When the pool hots 65%, buy something larger and replace an existing drive. (At this point the drive that was replaced can serve as a cold spare.)

When the pool hits 85%, buy something larger and replace the other drive. Depending on how fast the pool contents are growing you can match the larger drive or go bigger yet to prepare for the next increase.

I do this over a period of years with a 5 drive RAIDZ2 pool.

mercenary_sysadmin · September 11, 2024, 5:40pm

There’s no one size fits all answer. I typically add a new mirror vdev if I’ve got open bays and none of my existing drives are so old they need to be taken out behind the barn and given the final mercy.

With that said, you do need to think about how old your drives are, and also about how solid your backup routine is. If you’ve got solid, tested backups and you’re confident in your ability to restore data from them, you can run the drives on prod as long as you like, understanding that you’re risking downtime, not dataloss.

But if you’re not super confident in your backups, your restores, and the FREQUENCY of your backups (eg I back up hourly, not just daily, let alone not just every few weeks or “when I think about it”) then you have to get a lot more tyrannical about pre-fail replacements. If you view restoring from backup as anything scarier than “another day at the office,” you really shouldn’t run a mechanical HDD by much more than a decade, regardless of how healthy it appears.

So, this brings us back to the original question: add a new mirror vdev, or replace the drives in the first one? Well, if you have full confidence in your backups, there isn’t much question: add a vdev, if you’ve got empty bays, and don’t start replacing older vdevs with higher capacity drives until you either run out of bays or start seeing actionable problems with the older drives.

Similarly, if you’re keeping track of the age and health of your drives: add if you’ve got open bays, replace if not.

But, finally, if you’re not particularly confident in your ability to manage the system effectively, then sticking with a single mirror vdev might in fact be your best bet. Less to manage, less to go wrong. Also less performance, but hey… slightly less power, also, so it’s not all downsides!

zeefizz · September 12, 2024, 2:58am

I’d go with Option 2 as well - replace the existing 2 x 4TB with whatever you can afford at that time. Let’s say it is 2 x 10TB (just as an example).

Try to scrape together another system where you can plonk the 2 x 4TBs as your backup pool. Replicate a subset of the 2 x 10TB into the 2 x 4TB pool (mostly the critical stuff, I’m sure you can triage through the datasets to pick this out).

This will help you in multiple ways:

Yourself - i.e. self induced stupidity and errors. Data errors are trivial to fix thanks to ZFS snapshots. But you’ll be surprised at pool level mistakes that can’t be undone and poof - everything is now either gone or becomes too complex to fix/rollback.
Having a separate backup system that is not operated upon regularly (other than a zfs send) is a blessing in many ways - it insulates you from external issues at the system level and can be quite comforting to know that worst case you can recover from something catastrophic to your primary system to at least a reasonable level of functioning.
You can also use the 2 x 4TBs as an experimental pool to test out things like compression, reorganizing things (buffer/scratch space!), and as a general scratch area (including if you’d like as a stripe for that extra performance and scratchiness without data safety, for which you always have the primary pool anyway).

I hope my verbosity convinces you

karl · September 12, 2024, 2:36pm

Thanks for all your input; it was nice to hear others ideas. I have book marked this later reference.

I am still not sure what to do. I think in a situation like this, I am going to do nothing until I get to 65% - this feels like a good capacity to start to review things. Then I need to make a decision and just go with it.

Backups are good and are tested so I am confident it can be restored, so buying more lower capacity drives to add another vdev seem more appealing but I will reassess later.

Thank you.

mercenary_sysadmin · September 12, 2024, 2:47pm

Yup, that’s a good time to put together a plan and be ready to execute.

karl · September 17, 2024, 8:08am

Trigger warning

Right, I have made a decision and just gone with it. I decided to go with raidz1. I like cost saving / GB capacity by only needing three drives, and if I lose two drives it would be annoying but I do have backups. I had previously opted for the mirror vdev when the pool was in a previous PC with limited space and SATA ports. Now I have a large case and server m/board with 6 SATA (4x 6.0 Gbps ports). Processing power is not a problem so happier with potential increase in parity calculations.

By the way, I am likely to be banned from the forum for something I do with the pool. Forgive me.

Started with:

* mirror-0
  * Hard drive 1, 4 TB
  * Hard drive 2, 4 TB

Firstly, I removed one of the hard drives in the mirror vdev (let’s say hard drive 1). I now have a spare 4 TB hard drive. I formatted it into two partitions, roughly 2 TB each – you can bet I checked the drive serials five times! Before I added this to the mirror pool months ago, I used badblocks to check all sectors – and it passed – so I was confident in its abilities. I had a little look through SMART too just in case.

I have a spare WD Red Pro 2 TB hard drive I used for cold backup, but no longer need this so used this to expand the new pool – let’s call this hard drive 3; I bought it brand new from WD Store three years ago and has two years’ warranty left and have low usage. Why was a Red Pro being used as cold storage you ask? Well, it was in a pool but was superseded by a 4 TB hard drive.

I created a new pool:

* raidz1-0
  * Hard drive 1, 4 TB - part 1
  * Hard drive 1, 4 TB - part 2
  * Hard drive  3, 2 TB

I offered a sacrifice to the hard drive gods then commenced the ZFS send process; the sacrifice consisted of burning an SSD whilst dancing around the burning plastic chanting “spinning rust, spinning rust” over and over until the SSD had melted into an unrecognisable ball of plastic. Satisfied the gods had adequacy been appeased I continued.

I wanted to retain settings and snapshots so I used -R and --raw (these datasets are encrypted).

Once completed, I ran a scrub to be sure. Seemed slow and would take many days. Unfortunately, the pool size became 96% full, so I deleted a dataset I can easily restore and did not care about the snapshots. The scrub process seemed to have become happier.

At this point, I still have all data on one hard drive (on the original pool), and now on the new pool (plus backups elsewhere). As you know, I can only tolerate the loss of the 2 TB hard drive, because if I lose the 4 TB hard drive (split into two partitions) I will lose entire pool.

After the scrub comes the tricky part. I put the original pool offline and cleared the disk, after first verifying sizes of copied datasets. I used this 4 TB hard drive to replace part 2 – no reason for choosing this part first, just went with it. Four hours later the rebuild process had finished.

* raidz1-0
  * Hard drive 1, 4TB - part 1
  * Hard drive 2, 4 TB # new drive added
  * Hard drive  3, 2 TB

I am safe now I have three hard drive redundancy.

However the pool is still at 96% capacity. This is because I have empty space on the hard drive.

Here comes the clever bit. I deleted partition 2 and resized partition 1, but I was having problems. I could not (using fdisk) resize parition from 2 TB to 4 TB. After many attempts I could not do it. I went to bed to sleep on it. This morning I worked it out. The partition table was dos, thus limiting the size of the partitition, so I changed to gpt. After this, I went through process of deleting partition, create new one (which increases partition) full size and then imported the pool and, voilà, the pool capacity increased. To do this, I had to make sure I enabled autoexpand, which is off by default.

zpool set autoexpand=on [pool]

* raidz1-0
  * Hard drive 1 part 1, 4TB 
  * Hard drive 2, 4 TB
  * Hard drive  3, 2 TB

Total pool size: 5.45 TiB
Percent used: 52%

I wanted to do this now before these datasets grow too much over the coming years. I researched pros and cons around raidz1 vs. mirror and feel happier with this configuration as I felt adding more mirror vdevs are clunky especially considering data would not be balanced. Some say the “part 1” label is ugly, but I think it’s beautiful - it’s a reminder of the success of converting from mirror to raidz1.

I love ZFS

mercenary_sysadmin · September 17, 2024, 1:57pm

For future reference: it probably makes more sense to create your initial raidz1 degraded with a missing drive, then add the drive later, rather than doing the temporary “two partitions on one drive both in the same pool” thing.

It certainly would have scrubbed a lot quicker!

karl · September 18, 2024, 7:16am

37o3d9

Of course, ZFS continues to amaze me. I thought perhaps ZFS would refuse to build a raidz1 without three disks. I should have tested this first.

And yes you’re right about scrub times: mirror took 4 hours to scrub; raidz1 takes 1h44m!!

waltar · September 18, 2024, 8:57am

ZFS scrub time depends on used zpool used space so it takes seconds if empty until endless hours when full and in use, eg. a replace took 32h for a 16TB hdd in 4vdev 6hdd raidz2 (24hdd) which was just 32% full but even used.

mercenary_sysadmin · September 18, 2024, 11:39am

There are several ways to skin that particular cat. There’s an argument to create a vdev as explicitly degraded, but I can never remember it off the top of my head–so I usually use a sparse file as a placeholder, and immediately remove it.

root@box:/# truncate -s 4T /tmp/tmpdisk.raw
root@box:/# zpool create -o ashift=12 poolname raidz1 disk0 disk1 /tmp/tmpdisk.raw
root@box:/# zpool export poolname
root@box:/# rm /tmp/tmpdisk.raw
root@box:/# zpool import poolname

Like I said, there are more “elegant” ways to do this–but that one works perfectly well, and I can remember it without needing to look any obscure arguments up, so…

karl · September 18, 2024, 2:09pm

Thanks, I have noted for future reference. I am pretty static with my ZFS pools. I had same pool (the mirror one) up and running for 6 years (both disks changed a few times). I started with BTRFS (sorry for mentioning this on here) but was not keen, then found ZFS – and the rest is history.

mercenary_sysadmin · September 18, 2024, 2:56pm

I’m pretty “static” with mine too; thing is I manage a lot of them, so weird shit does come up occasionally despite my best efforts to avoid “weird shit.”

No need to feel bad about starting with btrfs. I was a btrfs cheerleader once, myself. Then I learned one too many lessons in one of too many common “hard ways” a btrfs user might learn them…