Is an automatic resilver after scrub finds errors normal?

awfulwoman · February 10, 2025, 9:11pm

So, the regular scrub on my Ubuntu 24.04 box revealed one disk of a mirrored vdev to have errors, and the pool in a degraded state. This disk had shown some errors in ZFS the day before, which ZFS had done some healing from so I already had a replacement disk in the post. While I waited for the disk to arrive I shut down the machine.

I rebooted the machine today and, after it had been up for a few minutes, went to look at the pool. To my surprise the pool was showing as healthy, as a resilvering event had taken place.

My question is - is this normal? Does ZFS always try to cover over disk errors like this? Because it feels a little hinky that a pool that was showing as degraded is suddenly health again.

mercenary_sysadmin · February 10, 2025, 9:38pm

I’m not 100% certain what triggers or won’t trigger an automatic resilver event specifically, but I have seen the behavior you’re describing–“one reboot later, it sorts itself out”–plenty over the years.

One of the more common scenarios is that a disk drops off the SATA bus, ZFS marks the pool degraded, then you bounce the box and the drive shows up on the bus again, ZFS finds it but notes that the pool was operational for some TXGs since the disk was last available, then fast-resilvers the drive back into the pool.

awfulwoman · February 10, 2025, 9:57pm

Hmmm. Rather annoying, as I have a 16TB drive sitting here waiting to be slotted in.

Given the previous errors I’ll be examining the logs of smartmontools. (And checking if I have notifications set up correctly).

Now I just need another 16TB drive to match this one(the previous being a 4TB…)

awfulwoman · February 11, 2025, 4:48pm

Hey! It turns out my smartctl alerts weren’t working! A manual check shows the disk as being riddled with errors.

awfulwoman · February 11, 2025, 6:55pm

Wrote up the shenanigans related to this here: An ode to replacing a disk in a ZFS pool | Awful Woman

mercenary_sysadmin · February 12, 2025, 3:48am

everything on that site, including just awfulwoman.com with no further page info, 404s on my end.

awfulwoman · February 12, 2025, 7:46am

Don’t know what to say - site and logs all seem fine. You must have caught it during a container restart or similar.

mercenary_sysadmin · February 12, 2025, 5:24pm

Can confirm, it’s back online now. That container restart must have been pretty damn slow… I was poking at it for about five minutes last night before giving up!

Clete2 · February 17, 2025, 12:15am

I’ve experienced similar phantom degradation of my pool in the past. It typically happens due to a loose or flaky connector. The first thing I do is shut down, reconnect all drives, reseat the SATA controller, and boot up and scrub.

I also check SMART like you’ve done. I run LibreNMS with alerts enabled if there are pending reallocated sectors. As soon as those are nonzero I preemptively replace the drive.

If you have a spare place in your case and a spare connector, go ahead and put that shiny new drive in and add it as a spare to the pool. It’ll make replacement faster (automatic? I can’t recall) when you have another failure in the future.

awfulwoman · February 20, 2025, 4:28pm

I’m using two backplanes with drive caddies that go back via two SAS connectors to an HBA card, so it’s unlikely (but not impossible) to be that. Otherwise I’d see the other drives on the shared backplane flaking out. It’s possible that a caddy was loose tho.

No, more likely was the drive itself - like I said it was riddled with errors when I checked it!

Now, a spare in the pool? That would resume I hadn’t lacked foresight and filled up every drive slot with a drive. I have some vdev removal to do… (tho I need to figure out how much memory the remapping of a few TiBs of data will cost me… )

Clete2 · February 20, 2025, 5:53pm

Don’t worry. When I have a failing drive, I add a spare. I put the spare on the floor of the case because I too have fully filled the case. You aren’t alone.