How zfs helped me to survive 4 hard disk crashes in a row

phoenix · July 25, 2023, 7:44pm

This is a short success story, on how I survived an apparent bad patch of hard disks without any data loss because zfs is awesome.

I blogged recently about a bad patch of hard disks - long story short, since January the RAIDz1 on my home NAS survived 4 hard disk crashes. Thanks to a regular scrub, each faulty disk was detected and replaced on time, so that I lost one disk every other week, but never had any data loss.

System configuration

I have a custom build openSUSE machine running on a rather old Intel Celeron N3150. It’s not the greatest, but for the data usage of my family it is still up for the job. I have a SATA SSD disk for the root partition plus 2 VM disk images (yes I’m crazy) and 3x Seagate Ironwolf 12 TB disks in raidz1 where all the juicy stuff is. The NAS is much older, I upgraded a patch of much older WD Blue (student’s budget) just 2.5 years ago to proper NAS drives. WD Blues were serving me well, except for one time a bad SATA cable resulting in some weird checksum issues from time to time. A cable replacement later those issues are gone for good.

A patch of bad hard disks

TLDR - Within 6 months 4 out of my 3 disk array broke down. Due to a rather strict scrub policy of once per week, none of those disk crashed got undetected for a longer time. Re-silvering worked nicely and despite loosing 133% of my original disks I have no data loss whatsoever.

So, starting in January I got a SMART error for one of the HDD disks. I was like uh-oh, but so far the zpool status was looking fine and no errors being reported. Until the first scrub, which immediately puked the disk out of the array (DEGRADED, disk OFFLINE). I put the disk out of the system and put it into my workstation. Already when powering the disk on, I heard scratching noises and the disk refused to work - I assume a head crash, based on the noises. So I returned the disk to the vendor and got a replacement disk - due to logistical issues only after a month waiting, in which I turned the NAS off to avoid further issues. Running a degraded system without replacement disks makes me nervous.

After getting the replacement disk I immediately ordered a second disk, so I have a cold spare. This will turn out to be a good decision in the following months.

So, after building the replacement disk into the system and re-silvering it, it purrs happily again like a well fed kitty next to a warm fire place. Until about a month later I get the same issue - SMART reports an “unrecoverable sector count increase” and in the following scrub, zfs spits out the disk due to IO errors. Me jumping on the bike, back to the vendor, asking for a replacement disk. In the meantime the cold spare disk is being resilvered. I will get the replacement disk a week later via home delivery but can keep the system online. Having a cold-spare was a good decision.

A week later, the replacement disk got spit out, same procedure as always. SMART starts to complain, but the zpool looks healthy and after the next scrub the disk gets ejected. This was a new disk, so back to the vendor. I got the other replacement disk in the meantime, so no downtime here as well. Resilvering worked well every time, and so far it’s a bit of work, but nothing bad happened

Some time later, the last disk from the original array also got borked. This time zpool reported some IO errors before SMART was complaining, but they both were within the same day. This time no scratching noises, also the disk could spin up. Just when doing a scrub, the disk was ejected at some point. Likely part of the disk was damaged, but no head crash this time. Still, I returned it, and they took it back without any complains.

I’m still puzzled why I had such a high failure rate on NAS-grade hard disks. Given that the same NAS was running fine for years with the old WD Blue hard disks, I assume to have gotten a bad patch. Time will tell.

My lessons learned

For me this story had several lessons learned

When SMART complains, the disk is likely already gone, but it will still take some time for you to notice (until you read or write in the damaged sectors)
A zpool scrub will detect a hard disk failure more reliably than monitoring SMART. Keeping the scrub frequency high (i.e. once per week for me) is a good way of ensuring that your hard disks do not have undetected damages
If uptime is important, have a cold/hot spare disk at hand. Supply difficulties are a reality
zfs is amazing, because it can a) detect faulty hardware (scrubs are amazing!) and b) resilvering is easy, effective and fast.
For me, SMART is nothing but complementary from now on. In the end I trust the results of a successful scrub more than having no complains from SMART. Because when (and if) SMART complains, it’s likely already too late.

mercenary_sysadmin · July 25, 2023, 10:48pm

You wouldn’t be the first person to buy several hard drives from a bad batch, believe me. It’s not “common” in the sense that anyone EXPECTS it to happen to them RIGHT THEN, but it’s definitely “common” in the sense that if you talk to any greybeard sysadmin (hi!) they’ll most likely be able to tell you they’ve seen it happen for themselves.

With that said… I might be a bit concerned about other factors in the hardware environment. Is it the “same” disk failing out every time? If so, you want to replace that SATA cable. If it keeps happening, you also want to try swapping disks between ports, to see if the problem stays with the same port.

If the failures are happening all over the place, not on any particular port or in any particular bay, I’d be looking at the power environment next. Is this system on a UPS? How old is the power supply? Etc.

phoenix · July 26, 2023, 6:40am

Thanks for the input! I couldn’t determine a common denominator between the disks, they were on different SATA ports, and also the timing of the failures appears random.

The NAS is behind a UPS, but the power supply is rather old. If there is a new failure this might become a consideration for replacement.

My only remaining hypothesis is that a construction site which is 20m away might cause some vibrations, which hard disks don’t like. After the second failure I put one of those rubber mats for washing machines under the server to mitigate that risk, but since more failures happened afterwards, I’m more inclined to believe it was just a bad patch.

Time will tell and thanks for the heads-up of the power supply. Gonna keep an eye on that one!

kneutron · July 28, 2023, 2:45pm

I would strongly recommend you rebuild that pool as a RAIDZ2, you got lucky this time. Also recommend burn-in testing every disk (new or old) before you put it into use in the pool so it weeds out shipping damage. And 1) make sure you have regular backups, 2) TEST YOUR RESTORES.

github.com

kneutron/ansitest/blob/master/SMART/scandisk-bigdrive-2tb+.sh

#!/bin/bash

# Burn-in test spinning disk before putting it into use
# W,R scan of big HD -- DESTRUCTIVE
# REQUIRES: hdparm, smartmontools, tee

# NOTE pass only sdX as arg, no /dev needed
# MAKE SURE you use the right /dev/sdX, I take *no responsibility* for data loss!
# Requires key-input / Enter to continue

# Recommended to run this from GNU ' screen ' as root

argg=/dev/$1
logfile=~/scandisk-bigdrive.log

# This is for old IDE drives
hdparm -c1 -d1 -u1 $argg

hdparm -S 120 $argg # fastsleep after test, save power

This file has been truncated. show original

phoenix · July 29, 2023, 1:20pm

The next pool in about 3-5 years will be likely 5 disks raidz2, so far I’m happy with a raidz1. In both cases I do and I will operate with an off-site backup. raidz1 is only about keeping the current zpool running, even if a hard disk crashes.

I’m not using dd for burn-in testing of my hard disks but wrote a small tool myself: disk-o-san - The main advantage is that I can interrupt this process at any given time and resume it where it left. I need this because the burn-in testing is performed on my workstation, which I tend to switch off during night. A single dd takes too long on a 12 TB disk, that’s why I wrote it.

Disclaimer: v1 has several issues, e.g. it’s slower than dd. I’m working on a v2 of the tool, but it’s summer with nice weather so the development has kinda stalled for a bit

mercenary_sysadmin · July 29, 2023, 2:54pm

Try to go six disks on your RAIDz2, if you can. Not the end of the world if you can’t, but you do get better efficiency when (n-p) is a power of two.

phoenix · July 30, 2023, 7:41am

Oh thanks, will keep that in mind. I guess the reason behind this consideration is that when (n-p) = 2^m, then with a suitable ashift= configuration the IOPS are aligned. In other words: If the block size matches the RAIDz2 configuration, then zfs needs only to perform one IOP per disk, instead of two.

Right?

mercenary_sysadmin · July 30, 2023, 2:43pm

Not exactly. If n-p is a power of two, then any block–all blocks are powers of 2–will divide evenly into it. For example, a 128KiB block divides evenly into two, four, or eight pieces–so a three wide Z1, a six wide Z2, and a ten wide Z2 or eleven wide Z3 can all divvy up that 128KiB of data evenly.

But in an offsize RAIDz, you need padding. For a five wide Z2, every stripe has data on three disks and parity on two. 128KiB/3 comes out to 42.7KiB.

You need 11 4KiB sectors to store 42.7KiB, which means 1.3KiB of padding per disk–which applies to the parity as well as the data, so instead of each 128KiB block being stored in 192KiB on-disk (32KiB times six disks, in a six wide Z2) you’re storing each 128KiB block on 44KiB per disk * 5 disks == 220KiB on disk.

This has an impact on both storage efficiency and on performance. It’s not an entirely catastrophic one, and as Matt Ahrens points out, it’s largely irrelevant for compressible data. But not all data is compressible… Especially not most of the many kinds of “Linux ISO” that people are so often building pools to store.

Put it all together, and nobody should feel bad about running an offsize RAIDz vdev… But it’s still not the worst idea to try to work with optimal widths if you can.

phoenix · August 3, 2023, 4:29pm

TIL - Thanks for the helpful summary!

Roopee · August 14, 2023, 8:21pm

Thanks for that info - I didn’t realise there was any such downside to running specific numbers of disks. My main Z2 pool is on 5 SSDs - but only because the laptop (+ ‘advanced’ dock) it is running on stupidly lacks any kind of access to the 6th Intel RST port; something I hadn’t realised when I first came up with the idea. I’d already bought 6 drives + spare, so instead I have 2 spares and less space (but enough).

Incidentally my first attempt at this configuration was with ESXi 6.7, which I’d been running in other configs for several years. Don’t go there!

phoenix · October 2, 2024, 6:57am

For any future reader: See also Why are my ZFS disks so noisy? | We Love Open Source - All Things Open
Suggestion: Start “Putting it all together” section and if you understand that, advance from there. Otherwise read the full article.

mercenary_sysadmin · October 2, 2024, 1:19pm

I was inspired to write that piece based on answering another question here!

karl · October 10, 2024, 12:28pm

Any further disk failures since OP?

phoenix · December 17, 2024, 7:50am

All good so far, no further failures. The disks are purring happily ever since.

And with purring, I really mean purring, those disks are loud AF (sorry for the language, but emphasis is required).

amacieli · March 24, 2025, 7:25pm

I went through a bad patch myself, last year. I have a RaidZ2 array, though, so even with 2 disks out (which I had, at one point, after one of the kids kicked out a SATA cable by accident), the array was still up. Anyway, I wrote a script that you all may find useful - creates a nice table of drive letters, serial numbers, pool status, and SMART status. So you can easily find the right disk if you need to. I even have my server email me the output every Friday

#!/bin/bash

# Check if required commands exist
if ! command -v zpool &> /dev/null; then
    echo "Error: zpool command not found. Please install ZFS utilities."
    exit 1
fi
if ! command -v smartctl &> /dev/null; then
    echo "Error: smartctl command not found. Please install smartmontools."
    exit 1
fi

# Temporary files
TEMP_ZPOOL=$(mktemp)
TEMP_LSBLK=$(mktemp)

# Run zpool status and save output
zpool status > "$TEMP_ZPOOL"

# Extract pool name
POOL_NAME=$(grep -m 1 "pool:" "$TEMP_ZPOOL" | awk '{print $2}')
if [ -z "$POOL_NAME" ]; then
    echo "Error: Could not determine pool name from zpool status."
    rm "$TEMP_ZPOOL" "$TEMP_LSBLK"
    exit 1
fi

# Fetch lsblk output for sd[a-z] drives
lsblk -dno NAME,SERIAL /dev/sd[a-z] 2>/dev/null | awk '{print $1 " " $2}' > "$TEMP_LSBLK"

# Print header with fourth column
printf "%-5s %-20s %-10s %-15s\n" "Drive" "Serial Number" "Status" "SMART Status"
printf "%-5s %-20s %-10s %-15s\n" "-----" "--------------" "------" "------------"

# Process drives from zpool status (sd[a-z] or ata-)
grep -E "^[[:space:]]+(sd[a-z]([^0-9]|$)|ata-)" "$TEMP_ZPOOL" | while read -r line; do
    # Extract name and status
    NAME=$(echo "$line" | awk '{print $1}')
    STATUS=$(echo "$line" | awk '{print $2}')

    # Determine drive letter and serial
    if [[ "$NAME" =~ ^sd[a-z]$ ]]; then
        DRIVE="$NAME"
        SERIAL=$(grep "^$DRIVE " "$TEMP_LSBLK" | awk '{print $2}')
    else
        SERIAL_SUFFIX=$(echo "$NAME" | awk -F'_' '{print $NF}')
        DRIVE=$(grep " $SERIAL_SUFFIX$" "$TEMP_LSBLK" | awk '{print $1}')
        SERIAL="$SERIAL_SUFFIX"
    fi

    # Skip if no drive match
    if [ -z "$DRIVE" ]; then
        echo "Warning: Could not map $NAME to an sd[a-z] device"
        continue
    fi

    # Extract portion after rightmost underscore
    if [[ "$SERIAL" =~ _ ]]; then
        SERIAL=$(echo "$SERIAL" | awk -F'_' '{print $NF}')
    fi
    [ -z "$SERIAL" ] && SERIAL="Unknown"

    # Get SMART status
    SMART_RESULT=$(smartctl -H "/dev/$DRIVE" | grep -i "SMART overall-health" | awk '{print $NF}')
    if [[ "$SMART_RESULT" == "PASSED" ]]; then
        SMART_STATUS="SMART PASSED"
    elif [[ "$SMART_RESULT" == "FAILED" ]]; then
        SMART_STATUS="SMART FAILED"
    else
        SMART_STATUS="SMART UNKNOWN"
    fi

    # Print formatted line with SMART status
    printf "%-5s %-20s %-10s %-15s\n" "$DRIVE" "$SERIAL" "$STATUS" "$SMART_STATUS"
done

# Pool status
POOL_STATUS=$(grep "state:" "$TEMP_ZPOOL" | head -n 1 | awk '{print $2}')
[ -z "$POOL_STATUS" ] && POOL_STATUS="Unknown"

# Print pool status (no SMART column for pool)
printf "\n%-5s %-20s %-10s\n" "Pool" "$POOL_NAME" "$POOL_STATUS"

# Clean up
rm "$TEMP_ZPOOL" "$TEMP_LSBLK"

Example output:

Drive Serial Number        Status     SMART Status
----- --------------       ------     ------------
sda   PK2334PCG37N3B       ONLINE     SMART PASSED
sdb   PK1334PBHJSSVP       ONLINE     SMART PASSED
sdc   PAGP0X4W             ONLINE     SMART PASSED
sdd   PK1334PBHHM5HX       ONLINE     SMART PASSED
sde   PK2381PBJJAJ9T       ONLINE     SMART PASSED
sdf   PK1334PEH0K2AS       ONLINE     SMART PASSED
sdg   PK2334PCG458MB       ONLINE     SMART PASSED
sdh   PN2334PCGPX73B       ONLINE     SMART PASSED

Pool  evilpool             ONLINE

Topslakr · March 25, 2025, 12:04am

This is neat! I do something similar, but with a boat load more info and not as pretty an output I’ve been migrating that over to a HealthChecks.io ping instead but I’m tempted by this!

The only issue for me is that it only shows me the output of one zpool. It shows all the disks in all the pools, but only one pool. Perhaps you only have one pool on your system?

Thanks for posting!