I know shit happens but does it have to happen today?

Cupbearer · July 10, 2023, 4:31pm

So I got two new SSD’s to place in front of my spinning rust as a special vdev for metadata and small blocks.

Then I notice this one drive in zpool status:

       NAME                                   STATE     READ WRITE CKSUM
           ata-HGST_HUH728080ALN600_ABCDWXYZ  ONLINE     26      0    12

and not long after:

       NAME                                   STATE     READ WRITE CKSUM
           ata-HGST_HUH728080ALN600_ABCDWXYZ  ONLINE     117     1    71

I asked around over at STH and dug around myself some more.

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed: read failure       90%     14067         1468064631
# 2  Short offline       Completed: read failure       90%     14067         1468064631

So I create a partition starting just before sector 1468064631 and I let this run for an hour or so:

# badblocks -b 4096 -c 1024 -s /dev/disk/by-id/ata-HGST_HUH728080ALN600_VLJVKBYY-part1
Checking for bad blocks (read-only test): 9 0.00% done, 0:04 elapsed. (0/0/0 errors)
100.00% done, 0:07 elapsed. (1/0/0 errors)
110.00% done, 0:09 elapsed. (2/0/0 errors)
120.00% done, 0:11 elapsed. (3/0/0 errors)
130.00% done, 0:13 elapsed. (4/0/0 errors)

Only errors so I cancelled it.

Also this:

  7 Seek_Error_Rate         0x000b   066   066   067    Pre-fail  Always   FAILING_NOW 1966194

So yeah, its dead. Stuff happens and its oke, except I didn’t have an 8TB spare …

I have some 4TB disks tucked away somewhere, which together are just about big enough to hold all the data in a raidz1.

Threw those in, syncoid -r ...

That was last night …

Playing around with the new SSD’s and somewhat disappointed with the performance I am seeing and wondering what that might cause … I get an alarm from one of the 4TB’s:

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   083   083   016    Pre-fail  Always       -       5636444
  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       3
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       3

Along with a bunch of ATA errors in the SMART error log.

So they are old (> 45000 hours) but really, did it have to happen today?

I am just done syncing one pool, and now I have to do it all over again.

I jumped the gun and ordered a box of SSD’s to get rid of this rust once and for all.

zeefizz · July 10, 2023, 4:40pm

I feel for you but hey, you are going to be having SSDs whizzing around! If this hadn’t happened, that would have well taken ages.

Separately, please check power + cables - I’m a bit suspicious.

mercenary_sysadmin · July 10, 2023, 5:49pm

Check your cables, and by check, I really mean “preemptively replace.”

I’ve seen ten times as many dodgy SATA cables as actual bad drives. Don’t get me wrong, drives ABSOLUTELY do fail… Just not as frequently as cables do, and particularly when it’s an issue of inconsistent errors.

Cupbearer · July 10, 2023, 5:57pm

Thanks for thinking along!

You’re absolutely right about those cables! I’ve had more troubles due to bad SATA cables than from bad drives.

I have 3 pairs of SFF-8643 to SATA breakout cables and an M1015 with 2 cables. I have swapped cables around and its the same disk that keeps getting errors.

With regards to the PSU, if that were the issue (and I’ve dealt with that before as well), there will be SATA errors in kernel logging and the issue jump to a random disk after idle or reboot. I have even had disks disappear altogether.

This issue is stuck to this drive. I even put in another machine. Its dead alright.

Just got confirmation my SSD’s will be delivered tomorrow.