Not quite a horror story, but certainly frustrating.
Last week my backup server threw a disk. No big, they were a few years old.
Then a few hours later it threw ANOTHER disk.
I immediately suspected cabling at that point, since I had opened the machine recently to tighten up some screws on the drive cage that had worked their way a bit loose and were driving me mad with their buzzing.
So I shut down the system, reseated all the cabling, making sure all was snug and secure.
Booted it back up and both drives are both showing online again, and everything started to resilver. Unfortunately, this was a pair of Z1 vdevs, so some data corruption had occurred, so I blew away the affected datasets and started replicating again from the “production” machine.
I went to bed that night, and when I woke up the next morning, both drives had faulted again.
At this point, I considered that perhaps I was just extremely unlucky and both drives (recertified 8TB Exos drives that I’d been using for a few years) were just bad.
So, I ordered a full set of 3 replacement disks and waited 2 days for them to arrive. (I know, I should have spares on hand, but I’ve been bad about doing so.)
Replaced the drives, blew away the entire pool and started again.
The next day, drives in the same 2 positions were faulted.
As it turns out, the SAS cable was the issue.
I’m not sure how I damaged it, and even examining it closely I don’t see a problem with it, but replacing the cable fixed the issue and the pool has been up for 36 hours with no faults, plus survived replicating 30TB+ of data on it.
Everything is good now, but boy does that ever prove the point that RAID is not a backup.
Considering that a single bad SAS cable corrupted my entire backup server pool and caused me nearly a week of frustration.