So, the regular scrub on my Ubuntu 24.04 box revealed one disk of a mirrored vdev to have errors, and the pool in a degraded state. This disk had shown some errors in ZFS the day before, which ZFS had done some healing from so I already had a replacement disk in the post. While I waited for the disk to arrive I shut down the machine.
I rebooted the machine today and, after it had been up for a few minutes, went to look at the pool. To my surprise the pool was showing as healthy, as a resilvering event had taken place.
My question is - is this normal? Does ZFS always try to cover over disk errors like this? Because it feels a little hinky that a pool that was showing as degraded is suddenly health again.
I’m not 100% certain what triggers or won’t trigger an automatic resilver event specifically, but I have seen the behavior you’re describing–“one reboot later, it sorts itself out”–plenty over the years.
One of the more common scenarios is that a disk drops off the SATA bus, ZFS marks the pool degraded, then you bounce the box and the drive shows up on the bus again, ZFS finds it but notes that the pool was operational for some TXGs since the disk was last available, then fast-resilvers the drive back into the pool.
Can confirm, it’s back online now. That container restart must have been pretty damn slow… I was poking at it for about five minutes last night before giving up!
I’ve experienced similar phantom degradation of my pool in the past. It typically happens due to a loose or flaky connector. The first thing I do is shut down, reconnect all drives, reseat the SATA controller, and boot up and scrub.
I also check SMART like you’ve done. I run LibreNMS with alerts enabled if there are pending reallocated sectors. As soon as those are nonzero I preemptively replace the drive.
If you have a spare place in your case and a spare connector, go ahead and put that shiny new drive in and add it as a spare to the pool. It’ll make replacement faster (automatic? I can’t recall) when you have another failure in the future.
I’m using two backplanes with drive caddies that go back via two SAS connectors to an HBA card, so it’s unlikely (but not impossible) to be that. Otherwise I’d see the other drives on the shared backplane flaking out. It’s possible that a caddy was loose tho.
No, more likely was the drive itself - like I said it was riddled with errors when I checked it!
Now, a spare in the pool? That would resume I hadn’t lacked foresight and filled up every drive slot with a drive. I have some vdev removal to do… (tho I need to figure out how much memory the remapping of a few TiBs of data will cost me… )
Don’t worry. When I have a failing drive, I add a spare. I put the spare on the floor of the case because I too have fully filled the case. You aren’t alone.