How do you locate a failed drive?

What is your preferred method for identifying the physical location of failed drives in an array? Some of these may depend on the level of hardware support, while others will generally be applicable regardless of hardware (e.g., a manually prepared map of drives). Some examples:

  1. Using visible labels with serial numbers (prelabelled on the chassis/tray if necessary)
  2. Pre-catalogued map of drive serial numbers, etc., to locations in enclosure
  3. Using the bay/tray position within the enclosure
  4. Illuminating identification or fail lights (e.g., with ledctl) if supported on the hardware

I label the front, visible part of the drive / carrier / sled with the serial number. Or last 4 characters. Then use /dev/disk/by-id to identify the physical device. Helps if the system supports hot-swap.

ls /dev/disk/by-id

ata-ST12000VN0007-2GS116_ZJV1T8AT

Of course the sn is easily identifiable on the device.

zpool status
mirror-3 ONLINE 0 0 0

ata-ST12000VN0007-2GS116_ZJV1T8AT ONLINE 0 0 0

wwn seems to work equally as well.

Sorry about the formatting

3 Likes

My workplace has a mix of everything but #1. #2 we do with aliases in vdev_id.conf, and #3 with channels, e.g.

multipath yes
channel 03:00.0 0 shelf18c
channel 03:00.0 1 shelf18d
channel 04:00.0 0 shelf18c
channel 04:00.0 1 shelf18d

resulting in

$ ls /dev/disk/by-vdev
shelf18c24  shelf18c32  shelf18c40  shelf18c48
...

My coworker likes to do ledctl to be extra certain, but my strategy for extra certainty is to pop the disk in the identified slot out and confirm that the host either no longer sees the failing disk I thought I pulled or, for a failed disk, that nothing changed, before I go replacing it. In the few cases I or a coworker have pulled a mislabeled or misidentified disk, ZFS quickly fixes the mistake after the disk is reinserted.

2 Likes

I had a similar question, answered in another post.

Question about drive labels - OpenZFS - Practical ZFS

I also printed out the serial number on the front side of each drive, so I can find them more easily without squinting at small serial numbers on the top :slight_smile:

2 Likes

I partition my disks and put a label on the partition with the disk slot number.

If a drive fails it will be obvious in zpool status which slot the dead drive is in.

1 Like

I, uh, wrote a script that uses dd to read a couple of gigs off the disk in a loop with a short delay between reads. Then I just look for the steadily blinking activity light, and that’s usually the drive I’m looking for. And if it’s not, well, I get to test whether the resilver function is working correctly. :grinning:

Variation on what others have wrote and I think this is FreeBSD specific and requires the controller to be in IT mode. I used to do the label thing but there’s really no need.

I add by serial number (/dev/diskid/DISK-) so it’s clear what the serial is on the failing disk, then use sesutil. “sesutil show” shows what slot everything is in and serial number for each “da” device. Once you determine what da device you want you can use “sesutil fault daXX on”, or you can use “sesutil locate daXX on” - fault on most enclosures is solid red or amber and locate makes the same light flash.

This doesn’t work so well if the drive is completely failed and dropped off the bus, but then “sesutil show” will show an empty slot. You can turn the locate lights on on the drives on either side of it and find it that way. Obviously it’s hard to get the serial number out of “zpool status” if it’s completely dropped out, but if that’s the case you’ll see an empty space in sesutil too.

2 Likes

Home hobby user here I made a drive spread sheet when I set up the pool, its been very handy already, the drive trays are labeled 1-8 the spread sheet ties the drive tray # to the serial number and wwn.

After first setting up I noticed some of the locations ran hotter than the others. the temperature and serial number are handily right there together in the smart data, the sheet translates this to a drive tray. turns out due to differences in the size of the air holes in the back plane so I re-arranged the 8 drives on a 24 slot backplane for best airflow,

Had I tied the spreadsheet to a chassis location moving things around would require editing the sheet, a step that could be missed (human factors) so that may be a plus for marking the trays.

Marking the serial number on the face of the tray might be handiest of all, no sheet required.

What is your preferred method for identifying the physical location of failed drives in an array?

I am doing the purely visible labels method, which worked out quite well in the past. For large installations, i.e. 19" racks or data centers I prefer the blinking LED method by far.