Can I trust the data from a degraded single drive

aidan · March 1, 2025, 12:59am

I have a single drive that started exhibiting Smart errors (32 Currently unreadable (pending) sectors and 32 Offline uncorrectable sectors and just today ATA error count increased from 0 to 3). I’ve bought 2 drives to replace it as a mirror however, and they’ve been added to my existing pool and I’m just doing a scrub before removing the two smaller drives they are replacing. The plan was to syncoid the datasets from the failing drive to the now larger pool but I’m now wondering if the data on the dodgy drive can be trusted. Thoughts appreciated.
This is what I’m seeing after a scrub on the old drive:

--> zpool status -x
  pool: trinity
 state: DEGRADED
status: One or more devices has experienced an unrecoverable error.  An
	attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
	using 'zpool clear' or replace the device with 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
  scan: scrub in progress since Fri Feb 28 11:00:42 2025
	7.57T / 8.39T scanned at 282M/s, 4.78T / 8.39T issued at 178M/s
	176K repaired, 56.95% done, 05:54:21 to go
config:

	NAME                                STATE     READ WRITE CKSUM
	trinity                             DEGRADED     0     0     0
	  ata-ST16000NE000-2RW103_ZL217QRC  DEGRADED    20     0     3  too many errors

errors: No known data errors

The url suggested above doesn’t seem to apply as it talks of raidz… The ‘No known data errors’ and 176K repaired give me hope, but how can it repair from a single copy of the data?

mercenary_sysadmin · March 1, 2025, 1:23am

Your zpool status shows three CKSUM errors on the second drive but none on the first. This means that three blocks were read from the failing drive that did not pass their CKSUM validation, so those reads were tried again on the second drive, which did pass its CKSUM for the same blocks.

Most likely, the pool also repaired the copies of those blocks on the failing drive while it was at it. But you don’t know that part for sure, until you scrub the pool and hopefully come back with no errors.

You obviously have major problems on that side of the mirror, but keep in mind that the culprit may not be the drive–it might be the cable (or several other possible issues, but the cable is the most common culprit aside from the drive).

aidan · March 1, 2025, 3:07am

Sorry, I mentioned a mirror but that’s the future state. Currently there is only one drive in this pool.

mercenary_sysadmin · March 1, 2025, 5:24am

Oh, I see! Sorry, I didn’t look at your zpool status closely enough. Yeah, those errors are irreparable. If you’re lucky, they won’t be in anything you can’t afford to lose, but you need to get your data off that drive IMMEDIATELY.

The ideal way to do that would be to just zpool attach another drive to the one you’ve got, making it into a mirror vdev. That won’t repair the damage you’ve already taken, but it’s probably the rapidest way to get back to a healthy state.

karl · March 3, 2025, 2:07pm

I have a one-disk pool (technically not a pool) for mass storge on my server, but this backs up, via the awesomeness of ZFS send, to my backup server with a 3-disk array. Should I get checksum errors, I would simply restore onto a new disk.

This drive is toast. I am wondering, did the 10-hour scrub ever finish? As you can see SMART, I am guessing it’s not a USB drive.

General question to anyone: will zpool status -v show affected files, because their output says 3 checksum errors yet no known data errors reported? Confused by that.

aidan · March 4, 2025, 3:47am

The scrub did complete; the error counts stayed the same. zpool status -v continued to show No known data errors… I guess the keyword there was ‘known’ as I verified some of the data (a Time Machine backup), and it failed. I had hoped to salvage that, but I have another copy that’s 2 weeks old, so it’s fine. This drive was meant to replace my old 8TB that was doing Time Machine duties.
The dodgy drive is currently being zeroed and will be RMA’d tomorrow. The replacement mirror is up and running, and let’s see if the old Time Machine passes validation.

Never a dull day with ZFS, and I mean that in the best possible way. I’ve had loads of fun the past few days sending/receiving datasets, destroying datasets, and zeroing a drive.

karl · March 4, 2025, 9:26am

Consider ZFS encryption for new datasets. If the drive is damaged and cannot zero, least you know your data is secure.

okapi · June 8, 2025, 7:45pm

If a file was affected, I believe you would get a list of files with zpool status -v. I had a similar situation not long ago and knowing the identity of affected files was really valuable because I could simply recover the affected files.