But do you know of anything I could do to diagnose further when all IO seems to hang? Any command I could try that doesn’t just hang forever?
Well, you could bypass the filesystem layer entirely to see whether the device itself has permanently hung, or if it’s just a particular I/O thread that died (and your filesystem was absolutely not prepared for the possibility of that thread dying, and therefore has not recovered from its unexpected death).
Let’s pretend your drive’s WWN is “wwn-qvo”. Try the following while the pool is unresponsive:
root@yourbox:~# pv < /dev/disk/by-id/wwn-qvo > /dev/null
If pv
hangs, you’re almost certainly looking at the entire drive dropping off the bus and staying there–as opposed to the “the drive is working but the filesystem stopped” scenario, which would imply that the drive is periodically dropping off the bus, but is then coming back on its own.
In my experience, the kind of symptoms you are describing indicate a flaky drive that is dropping off the bus, and again in my experience, the failure recovery on that typically requires a system reboot. But hey, hardware is weird, and sometimes it does unusually weird things… so like I said, troubleshoot with using pv
directly against the metal, bypassing every last bit of the filesystem stack.
Obviously, do not target the drive with pv
, or you will destroy your data. The pipe goes from the drive, to /dev/null
. That’s guaranteed non-destructive. And as long as your target is /dev/null
, it should be safe if you screw up the source and target–because while targeting the drive is absolutely a never-come-back scenario, /dev/null
won’t produce any data, so getting this backwards won’t actually screw up your drive.
If you ever start targeting a second drive instead of targeting /dev/null
, though, you’d better be damn sure to measure twice and cut once! 
possibly finding some kind of error I could refer to in a RMA
Have you tried piping dmesg
through grep
after the pool hang occurs? If the pool has hung and it’s on the same drive as your system logs, your system won’t be able to actually save errors in the log… but they should still show up in dmesg
while the system itself is running.
For this, you’ll need to know the bare devicename of the drive. We’ll pretend again that the WWN of your drive is wwn-qvo
and that you’ve properly added it to the pool using that WWN, rather than the bare devicename. First, make sure you’ve got your WWN correct by checking zpool status
, then refer to /dev/disk/by-id
to get its bare devicename:
root@box:~# zpool status | grep wwn
wwn-qvo-part3 ONLINE 0 0 0
root@box:~# ls -l /dev/disk/by-id | grep wwn-qvo | grep -v part
lrwxrwxrwx 1 root root 10 Feb 6 04:11 wwn-qvo -> ../../sda
Now that we know that the current bare devicename of wwn-qvo
is sda
, we can check dmesg
for any information about it during the current system session:
root@box:~# dmesg | grep sda
[ 1.095439] sd 1:0:0:0: [sda] 500118192 512-byte logical blocks: (256 GB/238 GiB)
[ 1.095516] sd 1:0:0:0: [sda] Write Protect is off
[ 1.095521] sd 1:0:0:0: [sda] Mode Sense: 00 3a 00 00
[ 1.095557] sd 1:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[ 1.098319] sda: sda1 sda2 sda3
[ 1.098754] sd 1:0:0:0: [sda] Attached SCSI disk
[ 3.058381] EXT4-fs (sda2): mounted filesystem with ordered data mode. Opts: (null). Quota mode: none.
[ 3.600933] EXT4-fs (sda2): re-mounted. Opts: errors=remount-ro. Quota mode: none.
In particular, notice the line where the system first brought the device online:
root@box:~# dmesg | grep sda | grep Attached
[ 1.098754] sd 1:0:0:0: [sda] Attached SCSI disk
This is an even lower level of “bare device identifier”: the SCSI (in this case, pseudo-SCSI) path layer. We can check dmesg again to see if it had anything else to say about that hardware path:
root@box:~# dmesg | grep "sd 1:0:0:0"
[ 1.094996] sd 1:0:0:0: Attached scsi generic sg0 type 0
[ 1.095439] sd 1:0:0:0: [sda] 500118192 512-byte logical blocks: (256 GB/238 GiB)
[ 1.095516] sd 1:0:0:0: [sda] Write Protect is off
[ 1.095521] sd 1:0:0:0: [sda] Mode Sense: 00 3a 00 00
[ 1.095557] sd 1:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[ 1.098754] sd 1:0:0:0: [sda] Attached SCSI disk
For the most part, you can expect hardware I/O errors to get logged with the normal bare device name, eg /dev/sda. But as you can see above, there is an occasional extremely low level log message that only uses the SCSI path identifier: in the example above, it’s the very first Attached message, which you’ll note happens first, before the system has even identified the capacity or capabilities of the device.
When you have a drive that’s dropping off the SATA bus, and a system that is still running except when you try to do anything at all that touches the bus, you can use dmesg
like this to identify both the failure, and when the failure happened.
Note that the timestamps in raw dmesg
output are in seconds since the running kernel booted, which may not be the most convenient for you to build a timeline from in human perspective rather than machine perspective. You can either add the output of uptime
to those timestamps to get a human-readable time in localtime format, or you can just pass dmesg the -T
argument to attempt to make it output in localtime format:
root@box:~# dmesg -T | grep "sd 1:0:0:0" | head -n1
[Tue Feb 6 04:11:48 2024] sd 1:0:0:0: Attached scsi generic sg0 type 0
However, note that these human-readable timestamps may be inaccurate as they do not account for time spent in suspend/resume states. The only thing that dmesg
knows for sure is how many seconds of runtime the kernel had on the odometer at the time that the event happened; it does not know what the actual localtime() was at the time of the event. That gets looked up separately by the syslog facility as the message is headed to disk from dmesg, as I understand it–so you can rely on human-readable timestamps from syslog
, but from dmesg
they’re essentially just a best-guess.