Greetings and questions about Ubuntu zroot raid reconfig - single drive to four

shad · October 12, 2024, 12:00pm

Hello Everyone,

This is my first post & since I didn’t see an official thread to post an introduction on, I’ll do it here if that’s ok, then follow with my questions below.

Based in the UK, I’ve been working in IT for more years than I care to say, pretty much exclusively in the Windows ecosystem. I currently find myself unable to peruse my other interests due to waiting for an injury to heal and thought it would be fun to setup a Linux pc to experiment with and take me out of my Windows comfort zone.

To that end, I brought a reconditioned HP Z840 with dual E5-2697A v4 Xeons & 256GB of RAM. Storage is currently a HP Z Turbo Quad Pro card with 4 x m.2 NVMe slots. It’s populated with two 4TB Kingston Fury Renegade SSDs with another two arriving next week. I also have a HP240 SAS3 card in HBA mode connecting to four LFF SETA drives for some raidz1 fun and experimentation. I may replace them with SAS3 drives once funds permit. It all depends on how much I get out of this project.

So, on to my question…

I managed to blindly follow step-by-step instructions to install ZFS Boot Menu with Ubuntu on a single ZFS drive (one of the NVMes). This seems to work fine, although it does take a while before the ZFS Boot Menu appears and I also had to disable secure boot to get the PC to boot at all. I got the instructions from the ZFS Boot Menu site: Noble (24.04) UEFI — ZFSBootMenu 2.3.0 documentation

When the other two NVMe dives arrive, I’d like to setup all four of these SSDs in a raidz configuration. From what I’ve gleaned about this vast subject a raidz2 or raidz10 config looks to be preferable, although if anyone feels differently, please feel free to say, I’m here to learn after all!

With a single drive already in use, is it actually possible to reconfigure the zroot pool to operate in either raid mode, or do I need a reinstallation?

If I do need to reinstall, I’d be looking at OtherJohnGray’s install scripts in conjunction with the instructions on the ZFS Boot Menu site and derive a set of steps from both. Thank you OtherJohnGrey for making such a post by the way. Although I don’t have need of it right away, it’s always nice when someone makes an effort to help noobs like me.

I think that’s about it for now, thanks for taking the time to read my ramblings and double thanks if you take the time to post!

Best regards to you all,

Shad

OtherJohnGray · October 13, 2024, 4:59am

Hi Shad! Welcome to the exciting world of ZFS

and congratulations on jumping ~~in the deep end~~ off the 10m platform on day one!

This might just be your BIOS taking a while to start up? I have a HP Z440 and it takes longer to start booting than other machines I have had. It’s not as bad as my HP ML110 though, the server BIOS takes forever.

The binary images from the ZBM website aren’t signed AFAIK. There are ways of signing things if you really want to, and there are reasons for doing so, but I wouldn’t personally bother at this stage. There are many other linux things more intersting to play with, and driveby malware isn’t quite such an urgent risk with linux (not that there aren’t exploits out there).

RAID Nomenclature is a little different with ZFS. RAIDZ1/2/3 is mostly analogous to RAID5/6 albeit with striping organised at a file block level rather than strictly striped across disk as in RAID. What would be called “RAID0”, i.e. striping, doesn’t really have a name in ZFS as it’s simply a pool that has multiple VDEVs in it - data is always striped across VDEVs weighted according to factors like the size and remaining capacity on the different VDEVs (so again a little more complex). “RAID1” is called a “mirror” and again is at the VDEV level, so it’s typical therefore to have more than one mirror VDEV in a pool, i.e. the pool is striped across mirrors, which would be called “RAID10” in standard RAID terms.

Notably you can also stripe across RAIDZ VDEVS too (what would that be, RAID50?) and also mix and match VDEV types in the same pool - e.g. striping across a 2xMirror, a 3xMirror, a RAIDZ1 and a RAIDZ3, although given the wildly different performance characteristics of the VDEV types, it’s unlikely anyone would do that other than for convoluted historical reasons.

There is also dRAID, which is similar to RAIDZ with the added benefit that spare storage is striped across all volumes for faster “resilvering” (recovery) in case of drive failure - but this is an advanced setup that really only makes sense for large enterprise disk shelves that have multiple hot-spare devices on standby.

@mercenary_sysadmin , founder of this site and the OpenZFS project’s long time go-to guy for education and community outreach, has some very strong opinions about the superiority of striped mirrors (i.e. RAID10 equivalent) over RAIDZ (i.e. RAID5/6 equivalent) due to ZFS specific details of the way each are implemented.

My take away from watching how people use these for a couple of years now is that RAIDZ is generally best reserved for bulk storage using a large number of disks (8+?)… think backup servers, video archives, and so on. Multiple two-way mirrors delivers much higher performance, recoverability, and flexibility, and doesn’t really have much of a space penalty compared to RAIDZ when using only a handful of mirrors. More knowledgeable ZFS users might disagree though?

With mirrors, it’s trivial:

zpool attach my-pool disk1p2 disk2p2

will turn your single disk1 VDEV into a mirror. Check the results with

zpool status

Then when you get the next two drives, you can partition them and then run:

zpool add my-pool mirror disk3p2 disk4p2

That will add a second mirror made of your zfs partitions on disk3 and 4. Any NEW data will be striped across both mirror VDEVs. Existing data remains on the first mirror. This is not a problem from a space point of view, but means that you lose a little bit of theoretical performance for data that is only being read from 2 drives instead of all 4.

There is a hack you can do to rebalance this by using zfs send to make a 2nd copy of the datasets, which will be balanced across the vdevs, and then deleting the old unbalanced datasets and renaming the new balanced datasets back to the old names. For a root-on-ZFS setup like yours, you would need to do this from the live USB image again as you would have a broken root filesystem during this migration.

Now, with RAIDZ, I don’t know of any way to convert a single drive into a RAIDZ volume, although again more knowledgeable users might. This doesn’t mean you would need to lose your installation though - you could just add a small extra disk, even a USB thumb drive would to, and create a single drive ZPOOL on it, then you can “zfs send” your painstakingly created datasets to that for safe keeping, then wipe the ZFS information from your disk with

zpool labelclear -f disk1p2

Then partition your other 3 SSDs, and create a new pool with a single RAIDZ vdev made out of the 4 partitions. You can then “zfs send” your datasets back, and copy zfsbootmenu into any EFI partitions you made on the other 3 disks for redundancy, and use efibootmgr to register all of them with the UEFI firmware.

Those “scripts” (actually there are manual steps in there too, watch out!) are just for building a custom zfsbootmenu with SSH access and/or a bleeding-edge version of ZFS. Unless you really need off-site SSH access during boot right now, I wouldn’t bother with making your own custom zfsbootmenu image, and would stick to using the EFI image you downloaded already from the zfsbootmenu website.

P.S. - hope you get well soon!

mercenary_sysadmin · October 13, 2024, 1:10pm

There isn’t, but there is a way to get everything moved from a single disk into a RAIDz vdev that will include the original disk eventually.

What you do is put in your new disks, then create a new pool and vdev which uses the new disks–but you create it degraded, so that it’s missing a disk.

# truncate -s 12T /oldpool/fakedisk.raw
# zpool create -o ashift=12 newpool raidz2 new0 new1 new2 new3 new4 /oldpool/fakedisk.raw
# zpool offline -f newpool /oldpool/fakedisk.raw
# rm /oldpool/fakedisk.raw

What we did there is create a sparse 12T file to act as a fake disk. (We created that on the old pool, because most tmpfs won’t accept a single file that large, but ZFS definitely will.) No matter how large, at creation a sparse file only occupies a few bytes of metadata space–it only grows on-disk as data is actually written to it. So you can create a 12T sparse file on a 90% full 12T disk without issues!

Now, we create a new pool with a single RAIDz2 vdev, six disks wide–the five new disks we just bought and installed physically, and our fake disk that we just created. This will cause a few kilobytes of data to get written to our sparse file, but no more than that.

Next, we offline our fake disk, putting the new RAIDz2 vdev into DEGRADED status but still functional. Then we destroy the fake disk entirely.

Now you can move all your data from the old pool to the new pool, scrub the new pool, then destroy the old pool and zpool replace that disk in to replace the fake disk we faulted out earlier. And Bob’s your uncle!

OtherJohnGray · October 14, 2024, 2:25pm

Nice!

I’m guessing that the degraded pool in this example still writes 4 data blocks but only 1 parity block per record, and adds the 2nd parity block when the original disk is resilvered in, thus preserving the same space efficiency that a non-degraded 6-wide RAIDz2 would have?

mercenary_sysadmin · October 14, 2024, 10:30pm

Correct. It behaves precisely as a normal degraded RAIDz vdev would, which is just the way you describe.

If you want to get seriously pedantic about it, I think it actually just doesn’t write whichever sectors would have landed on the missing drive, whether those sectors happen to be parity or data. Which in turn means that you’ll actually be reconstructing reads from parity for one of every (number of disks in non-degraded vdev) reads.

But the net effect is what you’re describing: a vdev which has been degraded in the past, then had the degraded drive replaced and successfully resilvered, doesn’t look or behave any differently from a vdev of the same topology and width which was never degraded at all.

shad · October 17, 2024, 10:37pm

Hello OtherJohnGrey and mercenary_sysadmin!

I must apologise for being so tardy in replying to such a nice welcome from you both. It’s just been one of those weeks where I haven’t had much of a chance to do anything with computers at home.

I have been able to add the new nvme drives though, and they seem to be functioning OK.

I think the weekend will be less hectic, so I’ll put together a reply that’s worthy of the time and effort you’ve put into yours!

Heartening to read that either of the approaches I asked about should be doable.

Thanks again for such a warm welcome ,

Shad.

shad · October 20, 2024, 9:55pm

Hello Gents,

I’ve been perusing your posts in more detail and I think I’d like to go with the Raid10 style setup. @OtherJohnGray I’m basically plagiarizing the info you provided!

To summarize, I have four 4TB nvme drives, the first is in a zpool called zroot.

pool: zroot
state: ONLINE
scan: scrub repaired 0B in 00:00:09 with 0 errors on Sun Oct 20 17:38:58 2024
config:

NAME STATE READ WRITE CKSUM
zroot ONLINE 0 0 0
nvme2n1p2 ONLINE 0 0 0

errors: No known data errors

To allow for an easier setup, I’ve setup aliases for all four nvme drives.

> # My vdev_id.conf file for saving into /etc/zfs/
> # adds aliases for the nvme drives i have installed
> # run "udevadm trigger" to update the /dev/disk/by-vdev/ list each time this file is saved
> # Last modified 2024-10-20 21:00
> 
> # Slot 0 of HP Z Turbo Drive Quad Pro (what a name!)
> # This is the drive with the OS on it.
> alias nv01    /dev/disk/by-id/nvme-KINGSTON_SFYRD4000G_50026B7686CFDABA
> 
> # Slot 1 of the card
> alias nv02    /dev/disk/by-id/nvme-KINGSTON_SFYRD4000G_50026B7686D30F19
> 
> # Slot 2 of the card
> alias nv03    /dev/disk/by-id/nvme-KINGSTON_SFYRD4000G_50026B7686E80D6E
> 
> # Slot 3 of the card
> alias nv04    /dev/disk/by-id/nvme-KINGSTON_SFYRD4000G_50026B7686E80DB5

This is what /dev/disk/by-vdev/ shows.

lrwxrwxrwx 1 root root 13 Oct 20 21:20 nv01 → …/…/nvme2n1
lrwxrwxrwx 1 root root 15 Oct 20 21:20 nv01-part1 → …/…/nvme2n1p1
lrwxrwxrwx 1 root root 15 Oct 20 21:20 nv01-part2 → …/…/nvme2n1p2
lrwxrwxrwx 1 root root 13 Oct 20 21:20 nv02 → …/…/nvme3n1
lrwxrwxrwx 1 root root 13 Oct 20 21:20 nv03 → …/…/nvme1n1
lrwxrwxrwx 1 root root 13 Oct 20 21:20 nv04 → …/…/nvme0n1

If (big if) I understand correctly, I can use ‘attach’ to mirror the existing drive.

zpool attach -w -o ashift=12 zroot nv01 nv02

This should re-silver nv02 and then scrub the zpool.
Next, I need to create a second mirrored set out of nv03 and nv04 and stripe these on the first two drives. I think this command will do it.

zpool add -o ashift=12 zroot mirror nv03 nv04

Questions…
If I have this right, this new mirrored pair will then stripe with nv01 and nv02. Data written will then stripe between nv01 and nv03 and this will mirror to nv02 and nv04?

What do you think gentlemen?

Best regards and thank you for your help,

shad

shad · October 21, 2024, 11:14pm

A quick update, very happy to report it all went well!

pool: zroot
state: ONLINE
scan: scrub repaired 0B in 00:00:11 with 0 errors on Tue Oct 22 00:12:33 2024
config:

NAME STATE READ WRITE CKSUM
zroot ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
nvme1n1p2 ONLINE 0 0 0
nv02 ONLINE 0 0 0
mirror-1 ONLINE 0 0 0
nv03 ONLINE 0 0 0
nv04 ONLINE 0 0 0

errors: No known data errors

Thanks for all your help!

shad