This is a short success story, on how I survived an apparent bad patch of hard disks without any data loss because zfs is awesome.
I blogged recently about a bad patch of hard disks - long story short, since January the RAIDz1 on my home NAS survived 4 hard disk crashes. Thanks to a regular scrub, each faulty disk was detected and replaced on time, so that I lost one disk every other week, but never had any data loss.
I have a custom build openSUSE machine running on a rather old Intel Celeron N3150. It’s not the greatest, but for the data usage of my family it is still up for the job. I have a SATA SSD disk for the root partition plus 2 VM disk images (yes I’m crazy) and 3x Seagate Ironwolf 12 TB disks in raidz1 where all the juicy stuff is. The NAS is much older, I upgraded a patch of much older WD Blue (student’s budget) just 2.5 years ago to proper NAS drives. WD Blues were serving me well, except for one time a bad SATA cable resulting in some weird checksum issues from time to time. A cable replacement later those issues are gone for good.
TLDR - Within 6 months 4 out of my 3 disk array broke down. Due to a rather strict scrub policy of once per week, none of those disk crashed got undetected for a longer time. Re-silvering worked nicely and despite loosing 133% of my original disks I have no data loss whatsoever.
So, starting in January I got a SMART error for one of the HDD disks. I was like uh-oh, but so far the
zpool status was looking fine and no errors being reported. Until the first scrub, which immediately puked the disk out of the array (DEGRADED, disk OFFLINE). I put the disk out of the system and put it into my workstation. Already when powering the disk on, I heard scratching noises and the disk refused to work - I assume a head crash, based on the noises. So I returned the disk to the vendor and got a replacement disk - due to logistical issues only after a month waiting, in which I turned the NAS off to avoid further issues. Running a degraded system without replacement disks makes me nervous.
After getting the replacement disk I immediately ordered a second disk, so I have a cold spare. This will turn out to be a good decision in the following months.
So, after building the replacement disk into the system and re-silvering it, it purrs happily again like a well fed kitty next to a warm fire place. Until about a month later I get the same issue - SMART reports an “unrecoverable sector count increase” and in the following scrub, zfs spits out the disk due to IO errors. Me jumping on the bike, back to the vendor, asking for a replacement disk. In the meantime the cold spare disk is being resilvered. I will get the replacement disk a week later via home delivery but can keep the system online. Having a cold-spare was a good decision.
A week later, the replacement disk got spit out, same procedure as always. SMART starts to complain, but the zpool looks healthy and after the next scrub the disk gets ejected. This was a new disk, so back to the vendor. I got the other replacement disk in the meantime, so no downtime here as well. Resilvering worked well every time, and so far it’s a bit of work, but nothing bad happened
Some time later, the last disk from the original array also got borked. This time zpool reported some IO errors before SMART was complaining, but they both were within the same day. This time no scratching noises, also the disk could spin up. Just when doing a scrub, the disk was ejected at some point. Likely part of the disk was damaged, but no head crash this time. Still, I returned it, and they took it back without any complains.
I’m still puzzled why I had such a high failure rate on NAS-grade hard disks. Given that the same NAS was running fine for years with the old WD Blue hard disks, I assume to have gotten a bad patch. Time will tell.
For me this story had several lessons learned
- When SMART complains, the disk is likely already gone, but it will still take some time for you to notice (until you read or write in the damaged sectors)
- A zpool scrub will detect a hard disk failure more reliably than monitoring SMART. Keeping the scrub frequency high (i.e. once per week for me) is a good way of ensuring that your hard disks do not have undetected damages
- If uptime is important, have a cold/hot spare disk at hand. Supply difficulties are a reality
- zfs is amazing, because it can a) detect faulty hardware (scrubs are amazing!) and b) resilvering is easy, effective and fast.
- For me, SMART is nothing but complementary from now on. In the end I trust the results of a successful scrub more than having no complains from SMART. Because when (and if) SMART complains, it’s likely already too late.