ZFS Pool of Mirrors vs Pool of Raidz?

Greetings,

I’m an avid listener of 2.5 Admins, and I’m at least one tenth of an admin myself.

I realize this is a niche upon niches but I’m wanting to bring a small “NAS” with me on the road in my RV.

I plan on using some sort of mini chassis with NVMe. I realize this isn’t going to saturate a 10Gb connection and I don’t care. It doesn’t need to be fast. (I’m fully expecting even with decent consumer mobo I’ll be lucky to get a combination of six single-lane NVMe sockets including mainboard and some sort of expander card. But hey! It slightly beats hanging storage off of USB! :wink: I think, anyway. Ok maybe 1/20th of an admin.)

It’s mostly for some local Plex / Jellyfin movies to not strain the Starlink as much, and to have network storage for the laptop and a mini desktop.

I read Jim’s infamous article (below) and now I’m conflicted. I was all set to build one 3x4TB array and eventually expand to another.

But this article is suggesting (ok, pretty definitively declaring) that I’m better off with 3 2x4TB arrays. I guess I’m only “losing” like 2-3TB with the 3 mirror config and I’m gaining a lot like less scrubbing and resilvering wear if (when) I lose a drive.

I was just curious about thoughts on the whole thing. :slight_smile: I don’t mind having three separate 4Tb pools to avoid taking down the whole pool if I were to join them together instead.

I’m also reading through the ZFS Mastery books so I’ll probably get a lot better understanding over the week or so as well.

Just trying to learn here and help future me not loathe past me for configuring a terrible storage pool.

Given that article is from 2015, I don’t know how much of the RAIDZ concerns apply to NVME drives. It might be worth looking for some benchmarks on RAIDZ resilvering. I’m sure it’s still slower than mirrors… but perhaps the increased speed compared to HDDs makes up for it?

I’ll say in my HDD based pool I’m very happy to have done mirrors.

One note from the article text:

Exactly how the writes are distributed isn’t guaranteed by the specification, only that they will be distributed.

This probably isn’t defined by the spec, but in practice data is written based on the free space within each vdev. So if you have two equal sized 4TB mirrors, they’ll fill at the same rate. If you start with one mirror, and later add a second 4TB mirror, then more writes will go to the second mirror since it has more free space.

I may be misreading how many drives you’re looking at, but with only 3 drives I’d be tempted to avoid RAID and mirrors at first. How common are failures like this on NVME drives anyways, especially if they don’t have a ton of lifetime writes? If you treat them as separate drives or pools, then in the case of a major failure you would still likely have 1-2 functioning drives and still have some media for your journeys!

That’s an excellent point about the article. I HAVE been trying to learn but a lot of what I’m finding is pretty old. I’ll definitely look for some evidence and info about resilvering NVMe’s. And you have a really good point about endurance and failure rate for NVMe’s at a core level.

I was wanting to learn zfs and this felt like a good project for it. But thinking through it again this may not even be a good use case for zfs. Maybe just treating them as separate drives is good enough and if I lose a drive I replace it and toss data back on it (I can always re-download the ISOs).

I’ll have offsite and cloud backups of actually important or irreplaceable data regardless.

So, my original thought was to maybe bite the bullet and toss 6x4TB and fill it all up to begin with and toss it in a six-wide Z-RAID (I believe 83% capacity with the capability to lose a drive). But, the “additional drive failures during resilvering” notes have me nervous. :smiley:

So, I was then looking at perhaps breaking that into 2 VDEVs, having 2 three-wide Z-RAID (66% capacity and ability to lose a drive). But, the issues with resilvering are there still. It does give me the ability to break the drive purchase into two phases and add the second three-wide VDEV later.

I’m also a bit concerned that scrubbing a Z-RAID might wear the NVMe’s badly too; and I’m planning on using “decent” consumer grade gear here, like Crucial NVMe’s so their write endurance is unlikely to be stellar to begin with. (Again, this may not be a good use case.)

So this then got me thinking well, maybe 3 two-drive mirrors is my best setup. I’d only be at 50% capacity; but that is “good enough”. And it would also allow me to break up the purchases into three phases (I don’t care if it ends up being three separate pools—gives me more admin practice with zfs.) :wink:

And, this leads me to zfs maybe isn’t always the best use case? I can just as functionally have six separate 4TB NVMe drives and treat them the way I always have. :sweat_smile:

It’s a fun learning experience already, to be 100% honest. And maybe zfs is better saved for when I build a “real” NAS in the homelab after we build our home next year.

If I’m understanding your last comment, it partially addresses a concern I didn’t even mention. I had considered just using them as “stripes” (I think they are called)…individual drives, VDEVs of one with no redundancy. But I read that a single failure would wipe out the whole pool.

BUT, it sounds like the solution there is just not pool them! (Which is simple, and I didn’t even think of it.) :smiley:

So, essentially I’d have six pools of one drive each, with 100% capacity and no fault-tolerance. I don’t think this is terrible. It’s more to manage, but that experience is exactly what I want anyway. And it eliminates all my concerns.

AND, I suppose if I wanted to I could take two of the drives and mirror so that I at least get to experience a multidisk pool in true zfs glory and have five total pools, most with single disks.

If your backup situation is robust enough, then worry about subsequent drive failure become a hassle rather than a disaster. Lose a pool ? Meh, restore (easily) from backups.

My media server is 2 pools of 5x 12TB drives each in raidz1. I have multiple backup systems - two wake up every night and zfs send/recv everything over, the others I do manually once a week or so.

One of those is a frankenstein box with 3 individual target pools. I use custom dataset values in the media datasets to determine which target pool to backup to. The backup boxes use older/smaller drives that would otherwise be sitting on a shelf. One box is in the basement cold room, another in the garage. Not offsite, but not right next to the media box either …

Power usage is minimal - they’re only ever on for a short period at 3am.

Literally any spare box you have that you can cram drives into will work. That frankenstein box is an ancient Athlon with 4G ram and an HBA feeding to a 24-drive expander. Fast isn’t in the vocabulary, but it does the trick, which is having the data in a second place. Not offsite, but separate HW.

One nice thing about 2 main pools rather than both 5x 12TB raidz1 vdevs in a single pool is that the splash damage is smaller - one pool failing doesn’t kill everything. I can export and remove one 5x pool, pop in a stack new drives and locally zfs send/recv to create another backup. Much faster than over the network. That’s how I pre-populated the various pools in the backup boxes.

Note - this all fits MY use case, which is primarily mass-storage. I don’t need blazing IOPS. Things may(will) be different for others …

2 Likes

BTW, zfs and snapshots and backups has saved my a** a few times. I have snapshots going back 12 months, and once had to go back about 6 months to recover stuff that was automagically removed by Sonarr losing its mind. 6 months to notice because we had already watched the show …

My laptops each have 2 nvme slots and I use root-on-zfs. One nvme is the main system, the other is the scheduled snapshot target every 15mins. Plus they zfs send/recv to backup boxes when they can see them on the local network.

Being able to roll back the whole system due to an apt-get update failing, or the home dir for XX reason is a godsend. Even just using httm to pull a specific snapshot version of a file is cool.

Storage is cheap. Losing data sucks. And you probably have old motherboards, power supplies, drives, etc. right there ready to be used …

GitHub - Halfwalker/ZFS-root: Set up root-on-zfs using whole disk, with dracut and zfsbootmenu is my own opinionated root-on-zfs setup. Used on literally dozens of systems.

1 Like

I ended up sticking with RAIDZ2 to survive 2 disks failing (which actually happened once). For my use case (primarily media storage and streaming, with Plex) and also some general file server-ing (where critical files get backed up to the cloud as well as a separate copy to my OneDrive). I have too much media for backing up, so I prioritized indestructibleness instead. Server’s been running 24/7 for like 6 years, so this strategy seems to be OK :slight_smile: As I understand it, two VDEVs are great if your priority is IOPS, but if you’re just doing it for a private media server, you don’t really need that kind of speed (even with multiple people streaming at the same time).

1 Like

RAID will not protect you from data loss — full stop (snapshots included).
What RAID actually does is reduce downtime. The higher the RAID level (and the more you invest), the more time you have to replace failing hardware before users start complaining.

In the end, the right solution depends on your priorities and budget:
• Backups matter most: The more independent copies of your data you keep elsewhere, the higher your chances of recovery.
• RAID helps with uptime: Higher RAID levels won’t magically protect your data — they only buy you extra time before a major outage.

Personally, I believe that unless you have a very specific IOPS requirement (which doesn’t seem to be the case here), worrying about performance with NVMe drives is just solving a problem that doesn’t really exist. You are not Netflix, right?

1 Like

I don’t care at all about performance. The NAS this is going to be in won’t be able to load from the tiny pool of lanes fast enough to saturate Ethernet. (Probably)

My main concern is additional wear on the NVMe’s particularly for scrubs and potential (likely inevitable) resilvering if/when a drive fails.

I think I’m going to order 3 x 4TB in Z1 as a learning experience and probably eventually add another separate 3x4TB (or whatever is reasonable storage cost) in the future.

If you grow by pairs, you can add additional mirrored vdevs to a pool. You won’t get the highest level of performance (because new data will be concentrated onto the new drives), but that really shouldn’t matter for your situation. However, it is nice to have a big pool so you don’t have to worry about manually moving data around.

The checksums on data certainly is a benefit!

If you do that you will lose the whole pool. That sort of setup is only worth it where something higher in the stack handles redundancy, or the drives are used as scratch disks where failure doesn’t matter at all.

The only thing you lose is that you have to think about where to put files as you get close to full storage.

100% agreed - snapshots are probably the killer feature even in single disk pools. Being able to cd into the hidden .zfs directory and copy out accidentally deleted files is helpful, as is being able to run zfs diff if a shell command or script goes wrong. httm is great too! I’m pretty happy with zrepl for managing snapshots.

Generally it’s writes that cause flash wear, not reads. So, scrubs shouldn’t have any effect on drives, and rebuilds would only affect the new drive being replaced. On many drives you can check the wearout indicators with smartctl.

For my root pool that is on consumer level SSDs, I purchased a third SSD a year in and swapped one of the mirror drives. That way, I have two drives with different lifetimes (and different manufacturers), and a spare in the house I can swap in if one fails. That ended up being cheaper than purchasing enterprise-grade SSDs.

One more optimization for reducing flash wear: you can increase zfs_txg_timeout from the default 5 seconds. This wouldn’t matter for media pools, but certainly would for root filesystems or anything with constant writes. I set options zfs zfs_txg_timeout=60 in /etc/modprobe.d/zfs.conf. What that means is that transactions are only synced every 60 seconds, unless a larger amount of data is queued to commit. It does mean that in the case of a hard crash, you could lose up to 60 seconds of data. Yet, this vastly cut the total writes to my root pool since so much of that data gets replaced within that 60 second window with different data.

3 Likes

I backup media to another ZFS pool and offsite drive. ZFS is not perfect and may damage your pool – you never know. We cannot control hardware failure, power surges, etc.

1 copy = I do not care about this data.

1 Like

I get the whole deal with backups etc., but cost comes into it too, so you make tradeoffs. If I save up enough to build another pool in another machine, I absolutely would set that up, and rsync or whatever the data over. My truly critical data is indeed backed up to two separate locations (possibly more), but there’s not nearly as much of it as there are rips of things like Monty Python movies. I don’t WANT them do get lost, and I do CARE about this, as re-ripping is a pain, but if the worst comes to the worst, I can always get them back.

2 Likes

As long as you understand your risk profile, it’s all good. We just get loud about this because we see people thinking their setup is immortal, CONSTANTLY, and then looking for somebody else to blame when it dies. :slight_smile:

3 Likes

Hold up. You realize that a 3x4T Z2 vdev only gives you 4T usable, right?

1 Like

Only thing that doesn’t die, is rock and roll, according to AC/DC.

1 Like

I’ve actually had a RAID-Z2 pool failure. One disk failed and the resilver stress cascaded to taking out two other drives. I suspect the relatively full pools, fragmentation and not having sequential resilver available at the time added to the drive stress. Fortunately I backed up all my important data but re-ripping all that media was a giant pain in the rear.

(My kid was born around the same time, I had paternity leave and more or less held her in one arm while the other juggled discs. It took awhile!)

1 Like

Yeah I’ve seen z2 failures as well. The extra parity tends to not make as big an impact as the extra feeling of security that admins get from having it which in turn tends to encourage any latent tendency not to bother regularly checking status, and next thing ya know…

I’ve also seen a handful of controller failures that caused every connected drive to get garbage spewed all over it for hours on end, power events that destroy every component in the machine simultaneously… I could go on.

Dual parity is not disaster recovery, it’s a form of high availability (HA)–and even at that, it’s really just a slightly increased window of time in which you can replace a failed druve without impacting uptime.

1 Like

Which online RAIDZ calculator is wrong. I heard it mentioned on 2.5 pod?
I am using: ZFS Capacity Calculator - WintelGuy.com

Also factor in your time as a cost to re-rip these. Surely you could get a 2 or 4 TB hard drive. Yes it might not fit entire media pool but it’s a start. I have an old PC with a few older disks as a last chance. Nice to have a secondary ZFS pool to fallback on.

I have a 10T drive which I only power on every once in a long while that has most of the stuff on it, as a secondary copy. Not exactly up to date, but better than a kick in the teeth.

1 Like

IIRC the most egregiously wrong one is called “raidz calculator” but I don’t recommend any of them. Want to know how much space you’d get out of a six wide Z2 of 10T drives?

First, convert 10 terabytes to tebibytes: 10 * 1000^9 / 2^30 = 9.313TiB. We’re going to use this size as an argument for truncate which doesn’t like fractions, so make that 9,313GiB per drive.

root@box:/tmp:# for I In {0..5} ; do truncate testpool-$i.raw -s 9313G ; done
root@box:/tmp:# zpool create testpool raidz2 testpool-*.raw
root@box:/tmp:# zpool status testpool
root@box:/tmp:# zfs list testpool
root@box:/tmp:# zpool destroy testpool
root@box:/tmp:# rm testpool-*.raw

This is fast and easy enough that I cannot possibly recommend any online calculator, even the ones that aren’t utterly batshit crazy wrong.

1 Like

Thanks I’ve bookmarked for future reference.

That’s still better than nothing – saves ripping entire collection. I have a last chance off-site hard drive too, should both pools get destroyed in house fire, which will have media and saves me downloading pictures from AWS. That deep glacier can get expensive to restore.