How zfs helped me to survive 4 hard disk crashes in a row

This is a short success story, on how I survived an apparent bad patch of hard disks without any data loss because zfs is awesome.

I blogged recently about a bad patch of hard disks - long story short, since January the RAIDz1 on my home NAS survived 4 hard disk crashes. Thanks to a regular scrub, each faulty disk was detected and replaced on time, so that I lost one disk every other week, but never had any data loss.

System configuration

I have a custom build openSUSE machine running on a rather old Intel Celeron N3150. It’s not the greatest, but for the data usage of my family it is still up for the job. I have a SATA SSD disk for the root partition plus 2 VM disk images (yes I’m crazy) and 3x Seagate Ironwolf 12 TB disks in raidz1 where all the juicy stuff is. The NAS is much older, I upgraded a patch of much older WD Blue (student’s budget) just 2.5 years ago to proper NAS drives. WD Blues were serving me well, except for one time a bad SATA cable resulting in some weird checksum issues from time to time. A cable replacement later those issues are gone for good.

A patch of bad hard disks

TLDR - Within 6 months 4 out of my 3 disk array broke down. Due to a rather strict scrub policy of once per week, none of those disk crashed got undetected for a longer time. Re-silvering worked nicely and despite loosing 133% of my original disks I have no data loss whatsoever.

So, starting in January I got a SMART error for one of the HDD disks. I was like uh-oh, but so far the zpool status was looking fine and no errors being reported. Until the first scrub, which immediately puked the disk out of the array (DEGRADED, disk OFFLINE). I put the disk out of the system and put it into my workstation. Already when powering the disk on, I heard scratching noises and the disk refused to work - I assume a head crash, based on the noises. So I returned the disk to the vendor and got a replacement disk - due to logistical issues only after a month waiting, in which I turned the NAS off to avoid further issues. Running a degraded system without replacement disks makes me nervous.

After getting the replacement disk I immediately ordered a second disk, so I have a cold spare. This will turn out to be a good decision in the following months.

So, after building the replacement disk into the system and re-silvering it, it purrs happily again like a well fed kitty next to a warm fire place. Until about a month later I get the same issue - SMART reports an “unrecoverable sector count increase” and in the following scrub, zfs spits out the disk due to IO errors. Me jumping on the bike, back to the vendor, asking for a replacement disk. In the meantime the cold spare disk is being resilvered. I will get the replacement disk a week later via home delivery but can keep the system online. Having a cold-spare was a good decision.

A week later, the replacement disk got spit out, same procedure as always. SMART starts to complain, but the zpool looks healthy and after the next scrub the disk gets ejected. This was a new disk, so back to the vendor. I got the other replacement disk in the meantime, so no downtime here as well. Resilvering worked well every time, and so far it’s a bit of work, but nothing bad happened :slight_smile:

Some time later, the last disk from the original array also got borked. This time zpool reported some IO errors before SMART was complaining, but they both were within the same day. This time no scratching noises, also the disk could spin up. Just when doing a scrub, the disk was ejected at some point. Likely part of the disk was damaged, but no head crash this time. Still, I returned it, and they took it back without any complains.

I’m still puzzled why I had such a high failure rate on NAS-grade hard disks. Given that the same NAS was running fine for years with the old WD Blue hard disks, I assume to have gotten a bad patch. Time will tell.

My lessons learned

For me this story had several lessons learned

  • When SMART complains, the disk is likely already gone, but it will still take some time for you to notice (until you read or write in the damaged sectors)
  • A zpool scrub will detect a hard disk failure more reliably than monitoring SMART. Keeping the scrub frequency high (i.e. once per week for me) is a good way of ensuring that your hard disks do not have undetected damages
  • If uptime is important, have a cold/hot spare disk at hand. Supply difficulties are a reality
  • zfs is amazing, because it can a) detect faulty hardware (scrubs are amazing!) and b) resilvering is easy, effective and fast.
  • For me, SMART is nothing but complementary from now on. In the end I trust the results of a successful scrub more than having no complains from SMART. Because when (and if) SMART complains, it’s likely already too late.
4 Likes

You wouldn’t be the first person to buy several hard drives from a bad batch, believe me. It’s not “common” in the sense that anyone EXPECTS it to happen to them RIGHT THEN, but it’s definitely “common” in the sense that if you talk to any greybeard sysadmin (hi!) they’ll most likely be able to tell you they’ve seen it happen for themselves.

With that said… I might be a bit concerned about other factors in the hardware environment. Is it the “same” disk failing out every time? If so, you want to replace that SATA cable. If it keeps happening, you also want to try swapping disks between ports, to see if the problem stays with the same port.

If the failures are happening all over the place, not on any particular port or in any particular bay, I’d be looking at the power environment next. Is this system on a UPS? How old is the power supply? Etc.

1 Like

Thanks for the input! I couldn’t determine a common denominator between the disks, they were on different SATA ports, and also the timing of the failures appears random.

The NAS is behind a UPS, but the power supply is rather old. If there is a new failure this might become a consideration for replacement.

My only remaining hypothesis is that a construction site which is 20m away might cause some vibrations, which hard disks don’t like. After the second failure I put one of those rubber mats for washing machines under the server to mitigate that risk, but since more failures happened afterwards, I’m more inclined to believe it was just a bad patch.


Time will tell and thanks for the heads-up of the power supply. Gonna keep an eye on that one! :slightly_smiling_face:

1 Like

I would strongly recommend you rebuild that pool as a RAIDZ2, you got lucky this time. Also recommend burn-in testing every disk (new or old) before you put it into use in the pool so it weeds out shipping damage. And 1) make sure you have regular backups, 2) TEST YOUR RESTORES.

The next pool in about 3-5 years will be likely 5 disks raidz2, so far I’m happy with a raidz1. In both cases I do and I will operate with an off-site backup. raidz1 is only about keeping the current zpool running, even if a hard disk crashes.

I’m not using dd for burn-in testing of my hard disks but wrote a small tool myself: disk-o-san - The main advantage is that I can interrupt this process at any given time and resume it where it left. I need this because the burn-in testing is performed on my workstation, which I tend to switch off during night. A single dd takes too long on a 12 TB disk, that’s why I wrote it.

Disclaimer: v1 has several issues, e.g. it’s slower than dd. I’m working on a v2 of the tool, but it’s summer with nice weather so the development has kinda stalled for a bit :sweat_smile:

1 Like

Try to go six disks on your RAIDz2, if you can. Not the end of the world if you can’t, but you do get better efficiency when (n-p) is a power of two.

3 Likes

Oh thanks, will keep that in mind. I guess the reason behind this consideration is that when (n-p) = 2^m, then with a suitable ashift= configuration the IOPS are aligned. In other words: If the block size matches the RAIDz2 configuration, then zfs needs only to perform one IOP per disk, instead of two.

Right?

Not exactly. If n-p is a power of two, then any block–all blocks are powers of 2–will divide evenly into it. For example, a 128KiB block divides evenly into two, four, or eight pieces–so a three wide Z1, a six wide Z2, and a ten wide Z2 or eleven wide Z3 can all divvy up that 128KiB of data evenly.

But in an offsize RAIDz, you need padding. For a five wide Z2, every stripe has data on three disks and parity on two. 128KiB/3 comes out to 42.7KiB.

You need 11 4KiB sectors to store 42.7KiB, which means 1.3KiB of padding per disk–which applies to the parity as well as the data, so instead of each 128KiB block being stored in 192KiB on-disk (32KiB times six disks, in a six wide Z2) you’re storing each 128KiB block on 44KiB per disk * 5 disks == 220KiB on disk.

This has an impact on both storage efficiency and on performance. It’s not an entirely catastrophic one, and as Matt Ahrens points out, it’s largely irrelevant for compressible data. But not all data is compressible… Especially not most of the many kinds of “Linux ISO” that people are so often building pools to store. :upside_down_face:

Put it all together, and nobody should feel bad about running an offsize RAIDz vdev… But it’s still not the worst idea to try to work with optimal widths if you can.

5 Likes

TIL - Thanks for the helpful summary!

1 Like

Thanks for that info - I didn’t realise there was any such downside to running specific numbers of disks. My main Z2 pool is on 5 SSDs - but only because the laptop (+ ‘advanced’ dock) it is running on stupidly lacks any kind of access to the 6th Intel RST port; something I hadn’t realised when I first came up with the idea. I’d already bought 6 drives + spare, so instead I have 2 spares and less space (but enough). :slight_smile:

Incidentally my first attempt at this configuration was with ESXi 6.7, which I’d been running in other configs for several years. Don’t go there!

2 Likes