Troubleshooting non-responsive pool

eeemil · February 5, 2024, 9:49am

Hello!
A few weeks ago a single-disk SSD pool became non-responsive. Running ls /datapool/ hangs and cannot be killed even with SIGKILL. I figure every command that makes a read to the pool get hung indefinitely, zpool export -f datapool hangs. dmesg gives no clue to any error.

The disk (Samsung 870 QVO 8 TB) has been in use for about 6 months and started acting up a few weeks ago but I’m guessing something is wrong with the disk. It runs fine for a couple of days after a reboot.

I’m not super worried about the data - the pool is backed up and only contains ephemeral stuff anyway and I’m mostly curious how to possibly troubleshoot, it’s quite tedious to trial-and-error troubleshoot since it seems like any command that touches the pool hangs and cannot be killed.

Edit:
zfs --version yields

zfs-2.1.5-1ubuntu6~22.04.2
zfs-kmod-2.1.5-1ubuntu6~22.04.1

I’m mostly curious how to actually troubleshoot when all I/O seems to hang indefinetely, possibly finding some kind of error I could refer to in a RMA if the issue is a faulty disk

mercenary_sysadmin · February 6, 2024, 12:04am

QVO drives are already kinda living on the ragged edge, just due to needing to manage 16 discrete potential voltage levels per cell as a QLC drive (as opposed to 2 for SLC, 4 for MLC, and 8 for TLC). I strongly suspect the issue is hardware failure of one kind of another, most likely in the drive itself.

If you want to eliminate ZFS as a factor, make sure your backups are good, then nuke the drive, reformat ext4, and play with it a bit. I’d recommend not entirely trusting QLC drives to the degree you would HDDs or TLC SSDs, though. I mean, you always need backups, but sometimes you NEED backups, you know what I mean?

eeemil · February 6, 2024, 11:01am

I’ve had a feeling you would say that, I’ve heard you talking about QVO drives on 2.5 admins. I bought it cheap in a sale, thinking I would use it for ephemeral storage.

Good idea to nuke it and switch to another file system. But do you know of anything I could do to diagnose further when all IO seems to hang? Any command I could try that doesn’t just hang forever?

Right now I’m thinking unplugging the SATA-cable and adding it again. It should be hot swappable.

I just realized that strace might give some clue on where things get stuck.

mercenary_sysadmin · February 6, 2024, 4:19pm

But do you know of anything I could do to diagnose further when all IO seems to hang? Any command I could try that doesn’t just hang forever?

Well, you could bypass the filesystem layer entirely to see whether the device itself has permanently hung, or if it’s just a particular I/O thread that died (and your filesystem was absolutely not prepared for the possibility of that thread dying, and therefore has not recovered from its unexpected death).

Let’s pretend your drive’s WWN is “wwn-qvo”. Try the following while the pool is unresponsive:

root@yourbox:~# pv < /dev/disk/by-id/wwn-qvo > /dev/null

If pv hangs, you’re almost certainly looking at the entire drive dropping off the bus and staying there–as opposed to the “the drive is working but the filesystem stopped” scenario, which would imply that the drive is periodically dropping off the bus, but is then coming back on its own.

In my experience, the kind of symptoms you are describing indicate a flaky drive that is dropping off the bus, and again in my experience, the failure recovery on that typically requires a system reboot. But hey, hardware is weird, and sometimes it does unusually weird things… so like I said, troubleshoot with using pv directly against the metal, bypassing every last bit of the filesystem stack.

Obviously, do not target the drive with pv, or you will destroy your data. The pipe goes from the drive, to /dev/null. That’s guaranteed non-destructive. And as long as your target is /dev/null, it should be safe if you screw up the source and target–because while targeting the drive is absolutely a never-come-back scenario, /dev/null won’t produce any data, so getting this backwards won’t actually screw up your drive.

If you ever start targeting a second drive instead of targeting /dev/null, though, you’d better be damn sure to measure twice and cut once!

possibly finding some kind of error I could refer to in a RMA

Have you tried piping dmesg through grep after the pool hang occurs? If the pool has hung and it’s on the same drive as your system logs, your system won’t be able to actually save errors in the log… but they should still show up in dmesg while the system itself is running.

For this, you’ll need to know the bare devicename of the drive. We’ll pretend again that the WWN of your drive is wwn-qvo and that you’ve properly added it to the pool using that WWN, rather than the bare devicename. First, make sure you’ve got your WWN correct by checking zpool status, then refer to /dev/disk/by-id to get its bare devicename:

root@box:~# zpool status | grep wwn
	  wwn-qvo-part3  ONLINE       0     0     0
root@box:~# ls -l /dev/disk/by-id | grep wwn-qvo | grep -v part
lrwxrwxrwx 1 root root 10 Feb  6 04:11 wwn-qvo -> ../../sda

Now that we know that the current bare devicename of wwn-qvo is sda, we can check dmesg for any information about it during the current system session:

root@box:~# dmesg  | grep sda
[    1.095439] sd 1:0:0:0: [sda] 500118192 512-byte logical blocks: (256 GB/238 GiB)
[    1.095516] sd 1:0:0:0: [sda] Write Protect is off
[    1.095521] sd 1:0:0:0: [sda] Mode Sense: 00 3a 00 00
[    1.095557] sd 1:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[    1.098319]  sda: sda1 sda2 sda3
[    1.098754] sd 1:0:0:0: [sda] Attached SCSI disk
[    3.058381] EXT4-fs (sda2): mounted filesystem with ordered data mode. Opts: (null). Quota mode: none.
[    3.600933] EXT4-fs (sda2): re-mounted. Opts: errors=remount-ro. Quota mode: none.

In particular, notice the line where the system first brought the device online:

root@box:~# dmesg  | grep sda | grep Attached
[    1.098754] sd 1:0:0:0: [sda] Attached SCSI disk

This is an even lower level of “bare device identifier”: the SCSI (in this case, pseudo-SCSI) path layer. We can check dmesg again to see if it had anything else to say about that hardware path:

root@box:~# dmesg | grep "sd 1:0:0:0"
[    1.094996] sd 1:0:0:0: Attached scsi generic sg0 type 0
[    1.095439] sd 1:0:0:0: [sda] 500118192 512-byte logical blocks: (256 GB/238 GiB)
[    1.095516] sd 1:0:0:0: [sda] Write Protect is off
[    1.095521] sd 1:0:0:0: [sda] Mode Sense: 00 3a 00 00
[    1.095557] sd 1:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[    1.098754] sd 1:0:0:0: [sda] Attached SCSI disk

For the most part, you can expect hardware I/O errors to get logged with the normal bare device name, eg /dev/sda. But as you can see above, there is an occasional extremely low level log message that only uses the SCSI path identifier: in the example above, it’s the very first Attached message, which you’ll note happens first, before the system has even identified the capacity or capabilities of the device.

When you have a drive that’s dropping off the SATA bus, and a system that is still running except when you try to do anything at all that touches the bus, you can use dmesg like this to identify both the failure, and when the failure happened.

Note that the timestamps in raw dmesg output are in seconds since the running kernel booted, which may not be the most convenient for you to build a timeline from in human perspective rather than machine perspective. You can either add the output of uptime to those timestamps to get a human-readable time in localtime format, or you can just pass dmesg the -T argument to attempt to make it output in localtime format:

root@box:~# dmesg -T | grep "sd 1:0:0:0" | head -n1
[Tue Feb  6 04:11:48 2024] sd 1:0:0:0: Attached scsi generic sg0 type 0

However, note that these human-readable timestamps may be inaccurate as they do not account for time spent in suspend/resume states. The only thing that dmesg knows for sure is how many seconds of runtime the kernel had on the odometer at the time that the event happened; it does not know what the actual localtime() was at the time of the event. That gets looked up separately by the syslog facility as the message is headed to disk from dmesg, as I understand it–so you can rely on human-readable timestamps from syslog, but from dmesg they’re essentially just a best-guess.