Proving the point that backups are important

Not quite a horror story, but certainly frustrating.

Last week my backup server threw a disk. No big, they were a few years old.

Then a few hours later it threw ANOTHER disk.

I immediately suspected cabling at that point, since I had opened the machine recently to tighten up some screws on the drive cage that had worked their way a bit loose and were driving me mad with their buzzing.

So I shut down the system, reseated all the cabling, making sure all was snug and secure.

Booted it back up and both drives are both showing online again, and everything started to resilver. Unfortunately, this was a pair of Z1 vdevs, so some data corruption had occurred, so I blew away the affected datasets and started replicating again from the “production” machine.

I went to bed that night, and when I woke up the next morning, both drives had faulted again.

At this point, I considered that perhaps I was just extremely unlucky and both drives (recertified 8TB Exos drives that I’d been using for a few years) were just bad.

So, I ordered a full set of 3 replacement disks and waited 2 days for them to arrive. (I know, I should have spares on hand, but I’ve been bad about doing so.)

Replaced the drives, blew away the entire pool and started again.

The next day, drives in the same 2 positions were faulted.

As it turns out, the SAS cable was the issue.

I’m not sure how I damaged it, and even examining it closely I don’t see a problem with it, but replacing the cable fixed the issue and the pool has been up for 36 hours with no faults, plus survived replicating 30TB+ of data on it.

Everything is good now, but boy does that ever prove the point that RAID is not a backup.

Considering that a single bad SAS cable corrupted my entire backup server pool and caused me nearly a week of frustration.

4 Likes

For whatever reason, cables seem to be the weakest link in the whole chain.

I couldn’t tell you how MANY damn times I’ve had drives fault out, come back after a reboot and fault out again. Fixes range from unplugging/re-plugging the connector a few times (both ends) to just replacing the damn cable.

Over the years I’ve probably tossed out at least 5x sets of 8087 breakout cables. Once a drive starts faulting multiple times after the unplug/re-plug dance, that cable is trashed. Can’t be trusted. Once in a while it’s the actual drive that fails, but at least 10-1 it’s the damn cable.

My workstations (and laptops) all have a backup drive in them. I run zrepl to snapshot and replicate the home (and other important dirs) datasets to the backup. I have multiple backup systems that power on at night, and they run a zrepl job to suck those datasets into them.

Heh, one of those backup boxes is an old tower with an ancient mobo in it. A single HBA with an expander and 24x old phased-out drives. Its only job in life is to power up at night and be a target.

For laptops you can set up a udev rule to kick off a replication to a plugged-in USB backup drive too.

2 Likes

Can confirm. Back when I was still using cases without hot swap bays, I used to replace cables before even attempting to replace drives.

These days, rack mount chassis with proper front load hot swap bays make replacing the drives so much easier, I do generally try replacing the drive before the cable. But the cable is still about as likely to be the problem as the drive is.

2 Likes

Oh definitely. For non-rackmount systems I’m fond of these for hot-swap

Regardless, vibration takes a toll, and the connectors either work just a tiny bit loose, or the connector plastic deforms every so slightly, or the metal contacts shift a teensy bit, and then the damn cables become untrustworthy.

I’ve found that the Rosewill ones work quite well and are quite a bit cheaper than Icy Dock.

This particular system does have some 5.25" bays, but I haven’t wanted to spend the money or energy to add hot swap into the backup server.

This incident does have me considering it, though.

1 Like

I’ve found Icy Dock to be a bit dubious, honestly. Like, in a one-off system that’s sitting on your desk and you just want a little bit of ease of drive swapping or whatever, probably fine. In a server… nope. Nope nope nope. Not for me personally.

The Rosewill cages are the exact same ones that they put in their twelve-bay server chassis. And the server chassis do quite well. The only real issue is you have to be sure that none of the backplane cabling gets pinched when the cage is installed; that happens occasionally even on the server chassis. If it happens, the rightmost bay in that cage will be non-functional until the wiring is fixed.

That happens often enough to be a known issue, but I’ve only seen it twice out of twenty plus systems built on that 12-bay chassis.