"48 offline uncorrectable sectors" - how much danger am I in?

hatchet · May 24, 2024, 7:24pm

TrueNAS emailed me overnight about one of my disks

New alerts:
* Device: /dev/da5 [SAT], 8 Currently unreadable (pending) sectors.

Current alerts:
* Device: /dev/da5 [SAT], 8 Currently unreadable (pending) sectors.

I kicked off a long SMART test about 6 hours ago which has an ETA of 16 hours from now. Since then, the number of bad sectors has apparently risen to 48 according to dmesg logs

May 24 09:52:39 freenas 1 2024-05-24T09:52:39.058105-05:00 freenas.home.lan smartd 1455 - - Device: /dev/da5 [SAT], 48 Currently unreadable (pending) sectors
May 24 09:52:39 freenas 1 2024-05-24T09:52:39.058140-05:00 freenas.home.lan smartd 1455 - - Device: /dev/da5 [SAT], 48 Offline uncorrectable sectors

This is a manufacturer recertified Seagate 16TB Exos X16 ST16000NM001G from serverpartdeals.com and I am within the RMA period. The disk is part of a mirror vdev so if it completely dies so I should be okay unless I lose another before I replace it.

Am I reading the SMART output correctly that there are > 46k reallocated sectors? I’m used to running brand new drives so I’m not sure what the expected threshold of bad sectors is, especially on a drive this large.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   078   064   044    Pre-fail  Always       -       129039848
  3 Spin_Up_Time            0x0003   097   097   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       3
  5 Reallocated_Sector_Ct   0x0033   090   090   010    Pre-fail  Always       -       46768
  7 Seek_Error_Rate         0x000f   075   061   045    Pre-fail  Always       -       33807005
  9 Power_On_Hours          0x0032   099   099   000    Old_age   Always       -       1227
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       3
 18 Head_Health             0x000b   100   100   050    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   063   050   040    Old_age   Always       -       37 (Min/Max 30/38)
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       0
193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       1494
194 Temperature_Celsius     0x0022   037   045   000    Old_age   Always       -       37 (0 21 0 0 0)
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
200 Pressure_Limit          0x0023   100   100   001    Pre-fail  Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       492h+19m+10.080s
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       1049009069
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       31299851767

bladewdr · May 24, 2024, 7:59pm

Yeah I would plan on replacing that drive ASAP.

The read error rate and seek error rate is also concerning, those should ideally both be at 0.

Unfortunately this happens, I’ve recently gotten unlucky with a few of my refurbs as well, but I’ve generally had a good experience with serverpartsdeals.

I think you know what I’m going to ask next… do you have a backup?

hatchet · May 24, 2024, 8:07pm

The rest of the day will be spent verifying the replication task has been working properly!

mercenary_sysadmin · May 25, 2024, 12:20am

Pretty sure that drive is toast, fam. Double-check that your backups are good immediately, replace that drive ASAP.

Be warned that SMART data isn’t always what it appears to be. I would not necessarily take the read and seek errors as dealbreakers; those can and do occur in perfectly healthy drives. Different drive firmware also tends to store SMART data in different ways with different meanings, even when the attribute name is the same, so it’s difficult to know exactly what the values mean, without access to technical documentation for your exact model of drive (which isn’t even always consumer-available), but I definitely don’t like the looks of those raw values.

bladewdr · May 29, 2024, 8:23pm

Yeah, if it was double digits or less in the seek and read errors, I’d probably brush those values off.

It’s bit too high for me to write it off as coincidence, especially with the offline sectors issue showing up at the same time.

karl · May 30, 2024, 8:54am

Seagate read errors always high due to way calculated. Need to use this calculator:

Seagate Error Rate Calculator (i.wtf)

OP SMART data (read 129039848) suggests zero errors.

Likely it’s in a ZFS pool on TrueNAS. Saved my ZFS

hatchet · May 31, 2024, 1:30pm

So then the next question is: how to safely remove the disk while keeping the pool running. Current plan is:

offline bad disk
shut down system
physically remove disk
power system back up

then when the replacement arrives:

power down system
install new disk
power up system
replace bad disk in pool with new disk

Am I missing anything?

mercenary_sysadmin · May 31, 2024, 2:06pm

Nope, that’ll do her. Although I would probably recommend waiting to remove the old drive until after you have the new one; you never know when you’ll lose your “healthy” drive completely while your ailing drive is still limping along on its last leg… and if you’ve still got the ailing drive, you get to keep your pool, but if you’ve already removed it, welp.

mercenary_sysadmin · May 31, 2024, 2:06pm

(The downside to keeping a last-legs drive in the pool is that it might cause performance problems.)

hatchet · May 31, 2024, 3:02pm

Normally yes I would wait to remove the bad one until I have a replacement, but this will be an RMA with serverpartdeals so I’ll have to ship back the bad one and wait.

I did verify that the pool is replicated to my “remote” truenas box which is unfortunately still located in my house. I’ll just have to be extra careful when using the stove until the new drive comes so I don’t burn down both copies of the data

SinisterPisces · July 7, 2024, 2:00am

Sorry to hear you’re having a not-great time, but thanks for posting this thread. I just learned some new stuff about reading SMART reports. … And I’ve also learned that apparently Seagates are overeager in their SMART reporting and you need the secret decoder ring @karl mentioned.

Glad to hear you’ve got a healthy replica. The built-in replication feature is one of the things that made me determined to get my head around ZFS when my head really didn’t want to be a team player and learn ZFS.

Am I reading the SMART output correctly that there are > 46k reallocated sectors? I’m used to running brand new drives so I’m not sure what the expected threshold of bad sectors is, especially on a drive this large.

I’ve had excellent luck with used enterprise drives sold from reputable eBay dealers (server parts/corporate installation liquidators with tens of thousands of postive feedbacks + a warranty policy provided by the seller). More often than not, it seems (from the SMART stats, at least), that a lot of the used enterprise drives on the market sit in a server for years being barely used for writes and moderatealy used for reads, spending the bulk of their time idling.

At the sizes and specs I want, used enterprise has been the only real economical way to go for the number of drives I have. It’s definitely not something to be scared of, but you should certainly purchase from the most reputable reseller that has the disk(s) you want.

Just as a point of comparison, the only disk I’ve had fail since building my first storage machine (or rather, buying my first QNAP) was a brand new WD Gold 14 TB that I managed to get sealed in box at “someone died under mysterious circumstances while in the room with this” pricing. And my HDD NAS lives in my bedroom where I sleep, so it’s got the best environment I can manage. So, brand new drives can go sideways, too.

And as a bonus, I now have my Ph.D. in The Western Digital 10,000 Point RMA Process.

bladewdr · July 8, 2024, 12:52am

What the actual f%$k.

Thanks for the link, why do they insist on making this so needlessly complicated? What’s the point, who is this benefiting?

This actually makes me irrationally angry.

karl · July 11, 2024, 9:05am

I would like to use ServerPartDeals but in the UK no similar site I have found. Shipping from US was pricy. I too buy from eBay. I have bought 3 enterprise drives from eBay sellers. I applied similar strategy: IT recycler, strong feedback and have lots of items on sale (bonus if they have the Offer button). So far, all has been good. My two wins were 2020 drives with warranty remaining and 0 SMART errors. Of the drives I have, they have long run times, and low (or even 1 or 2) start / stop counts; I believe is better to keep driving running then power cycling frequently.

Prior to this, I used to buy new drives (2TB), but the drives prices have increased (thanks AI) and my need is now 4TB drives (got to be careful with SMR); that means I am looking at £140 new. So I figure, £37 for an enterprise drive, mirrored in ZFS with backup (ZFS replication) and in cloud is more than adequate.

My strategy with drive expansion/replacement is to add the third drive to the mirror, then (if drive working) split and keep as archive, otherwise remove disk after.

I have only used mirrors in ZFS.

SinisterPisces · July 15, 2024, 1:23am

This is my usual experience. Tens of thousands of hours of runtime and less than a dozen start/stops. I have indeed read from mutliple sources that frequent power cycling is to be avoided if possible.

I’m still unclear on whether messing with the spin up/spin down settings for the drives is worthwhile, but I think that’s probably a topic for another thread.