A drive in my zpool had a "transient error". Do you agree with chatgpt's assessment?

vortexsurfer · June 23, 2025, 4:46pm

Hi,
About a month ago I noticed read errors on one drive in my zfs pool. The pool consists of 4 vdevs, each vdev is a 2 drive mirror.

zpool status reported that the previous scrub (early in May) had corrected some errors (I believe it was a few megabytes worth).
So I ran a extended SMART tests on the drives in the mirror, and one had some concerning results and a read failure. The other drive was fine, so I didn’t panic.
But, since I don’t have much experience with interpreting SMART errors, I asked chatgpt for help.

Here’s the chat (with more details about the SMART errors):

As you can see near the bottom of the chat, I have since moved all the drives to a new computer/server, in a new case with better cooling, and I also replaced all SATA cables in the process. After doing all this I ran a new extended SMART test, and the drive now reports no errors! (I realize that making all these changes together makes it harder to pinpoint what the exact cause of the errors was.)

I have also ordered two new drives, which at first I intended to use to replace the failing drive + its mirror buddy.
But, since the “failing” drive now reports that it’s fine, I’m considering just adding the new drives to the pool as a new vdev, to gain a whole lot more free space (which I need).
ChatGPT thinks the old drives should be fine, and its assessment seems solid (to me), but I also don’t want to blindly trust AI… and I’ve never experienced this kind of “transient error” before.

Any input / thoughts on this situation? Do you agree with ChatGPT’s assessment? Should I replace the drives, or keep them a while longer? Can I trust the successful SMART test? The two drives were bought together and added to the pool at the same time, and have been running more or less 24/7 for almost 3 years. (And yes, I’m gonna set up automated regular SMART tests + alerts asap, which is something I’ve neglected for far too long…)

Thanks in advance for any advice or input

mercenary_sysadmin · June 23, 2025, 6:53pm

Transient errors aren’t all that uncommon, and frequently aren’t the fault of the drive itself: I find that I see CKSUMs caused by dodgy SATA cables more often than by failing drives.

Although I will note, back when I used cheaper drives (eg WD Blue or Green, Seagate Barracuda) I got a LOT more CKSUM errors, along with serious drive failures happening early in the drives’ projected lifespan a lot more frequently.

vortexsurfer · June 24, 2025, 5:23pm

Good to know, thanks! The drive in question is a WD Ultrastar DC HC550.
The new drives I’ve ordered are Seagate Barracudas, since they are currently very cheap, but of course they’re not “NAS grade”. So maybe I’ll rethink how I’ll use those…
Looking at my pile of old dead drives, I see most of them are Barracudas or WD Greens. I had forgotten about those.