"48 offline uncorrectable sectors" - how much danger am I in?

TrueNAS emailed me overnight about one of my disks

New alerts:
* Device: /dev/da5 [SAT], 8 Currently unreadable (pending) sectors.

Current alerts:
* Device: /dev/da5 [SAT], 8 Currently unreadable (pending) sectors.

I kicked off a long SMART test about 6 hours ago which has an ETA of 16 hours from now. Since then, the number of bad sectors has apparently risen to 48 according to dmesg logs

May 24 09:52:39 freenas 1 2024-05-24T09:52:39.058105-05:00 freenas.home.lan smartd 1455 - - Device: /dev/da5 [SAT], 48 Currently unreadable (pending) sectors
May 24 09:52:39 freenas 1 2024-05-24T09:52:39.058140-05:00 freenas.home.lan smartd 1455 - - Device: /dev/da5 [SAT], 48 Offline uncorrectable sectors

This is a manufacturer recertified Seagate 16TB Exos X16 ST16000NM001G from serverpartdeals.com and I am within the RMA period. The disk is part of a mirror vdev so if it completely dies so I should be okay unless I lose another before I replace it.

Am I reading the SMART output correctly that there are > 46k reallocated sectors? I’m used to running brand new drives so I’m not sure what the expected threshold of bad sectors is, especially on a drive this large.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   078   064   044    Pre-fail  Always       -       129039848
  3 Spin_Up_Time            0x0003   097   097   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       3
  5 Reallocated_Sector_Ct   0x0033   090   090   010    Pre-fail  Always       -       46768
  7 Seek_Error_Rate         0x000f   075   061   045    Pre-fail  Always       -       33807005
  9 Power_On_Hours          0x0032   099   099   000    Old_age   Always       -       1227
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       3
 18 Head_Health             0x000b   100   100   050    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   063   050   040    Old_age   Always       -       37 (Min/Max 30/38)
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       0
193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       1494
194 Temperature_Celsius     0x0022   037   045   000    Old_age   Always       -       37 (0 21 0 0 0)
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
200 Pressure_Limit          0x0023   100   100   001    Pre-fail  Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       492h+19m+10.080s
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       1049009069
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       31299851767
1 Like

Yeah I would plan on replacing that drive ASAP.

The read error rate and seek error rate is also concerning, those should ideally both be at 0.

Unfortunately this happens, I’ve recently gotten unlucky with a few of my refurbs as well, but I’ve generally had a good experience with serverpartsdeals.

I think you know what I’m going to ask next… do you have a backup? :slight_smile:

The rest of the day will be spent verifying the replication task has been working properly!

Pretty sure that drive is toast, fam. Double-check that your backups are good immediately, replace that drive ASAP.

Be warned that SMART data isn’t always what it appears to be. I would not necessarily take the read and seek errors as dealbreakers; those can and do occur in perfectly healthy drives. Different drive firmware also tends to store SMART data in different ways with different meanings, even when the attribute name is the same, so it’s difficult to know exactly what the values mean, without access to technical documentation for your exact model of drive (which isn’t even always consumer-available), but I definitely don’t like the looks of those raw values.

3 Likes

Yeah, if it was double digits or less in the seek and read errors, I’d probably brush those values off.

It’s bit too high for me to write it off as coincidence, especially with the offline sectors issue showing up at the same time.

Seagate read errors always high due to way calculated. Need to use this calculator:

Seagate Error Rate Calculator (i.wtf)

OP SMART data (read 129039848) suggests zero errors.

Likely it’s in a ZFS pool on TrueNAS. Saved my ZFS :tada:

So then the next question is: how to safely remove the disk while keeping the pool running. Current plan is:

  • offline bad disk
  • shut down system
  • physically remove disk
  • power system back up

then when the replacement arrives:

  • power down system
  • install new disk
  • power up system
  • replace bad disk in pool with new disk

Am I missing anything?

Nope, that’ll do her. Although I would probably recommend waiting to remove the old drive until after you have the new one; you never know when you’ll lose your “healthy” drive completely while your ailing drive is still limping along on its last leg… and if you’ve still got the ailing drive, you get to keep your pool, but if you’ve already removed it, welp.

(The downside to keeping a last-legs drive in the pool is that it might cause performance problems.)

Normally yes I would wait to remove the bad one until I have a replacement, but this will be an RMA with serverpartdeals so I’ll have to ship back the bad one and wait.

I did verify that the pool is replicated to my “remote” truenas box which is unfortunately still located in my house. I’ll just have to be extra careful when using the stove until the new drive comes so I don’t burn down both copies of the data :wink:

1 Like