What could cause ZFS to kick a disk out of a pool?

bladewdr · April 6, 2024, 12:12am

Obviously, things like read/write errors… the drive dropping off the bus, but what else could be a possible cause for this? Are there any other pre-failure characteristics that ZFS looks for that wouldn’t show up in SMART data?

I have a disk that I recently replaced in my backup NAS at home. It was time for the disk to replaced anyway, as it was a 6 year old WD Red near the end of it’s usable lifetime, but I’m not entirely clear as to what led to the disk being kicked out of the pool.

There were no recent hardware changes, so a loose or dodgy cable seems unlikely. Out of curiosity, I put the drive in an external dock that I have on hand so I could take a look a the SMART data, and didn’t see anything that looked like a smoking gun.

Obviously power-on hours were high, but no read errors, no spin retries, not even any recorded reallocated sectors. Obviously, without more extensive testing I’ll probably never know for sure, but I’m curious if anyone has any ideas.

I’ve no interest in actually using a potentially dodgy drive, my interest is purely academic.

HankB · April 6, 2024, 1:00pm

Did you check system logs? I had a WD Red (WD30EFRX-68EUZN0) fail, but it was back when I was using MD RAID so I don;t know how ZFS would handle this.

The SMART counters were unremarkable with the exception of 1 pending sector. smartctl -a did list a bunch of logged errors that looked like

Error 251 occurred at disk power-on lifetime: 33237 hours (1384 days + 21 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 08 b8 37 41 ee  Error: UNC 8 sectors at LBA = 0x0e4137b8 = 239155128

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 08 b8 37 41 ee 00   6d+16:52:09.166  READ DMA
  c8 00 08 b0 37 41 ee 00   6d+16:52:09.166  READ DMA
  ca 00 08 b0 37 41 ee 00   6d+16:52:09.166  WRITE DMA
  ef 10 02 00 00 00 a0 00   6d+16:52:09.165  SET FEATURES [Enable SATA feature]

There were also a lot of errors listed in dmesg output at the time.

If you didn’t see any errors in either of these It is truly a mystery. If the HDD just stopped responding because the controller crashed, it might not log anything to SMART but I would expect indications of this in the system logs.

bladewdr · April 6, 2024, 1:16pm

Nope. It’s also been months since the replacement, I just happened to still have the old drive on my desk and decided to look at it yesterday.

There are other drives using the same controller so I don’t think that’s it.

Next time I’ll have a closer look through the logs.