I’m running ZFS on Ubuntu 22.04 on several systems, and that means
$ zfs --version
zfs-2.1.5-1ubuntu6~22.04.4
zfs-kmod-2.2.2-0ubuntu9.4
I recently started to get kernel panics when importing a zpool, and googling around I found both that I’m not completely alone in this, and I also found for example this comment, it’s an old one but it indicates that the basic operational concept of ZFS is to fail loudly and completely when it detects errors, and tell the user to recreate the zpool from backups.
Is that still the case (i.e., that zpool re-creation from scratch is the only way), or is there some standard way to try to repair a zpool that apparently has metadata problems (depending on what type of metadata problem, obviously)? (Bad data that is found by a scrub has never caused me problems, but ZFS seems to be lousy at handling bad metadata.)
I also have a specific question: How do I mount an encrypted filesystem from a read-only zpool at a temporary mountpoint? Because when I try to import a zpool as readonly, all unencrypted filesystems appear instantly. I can load encryption keys in the readonly zpool. But when I try to mount them, I get the “cannot mount '/.../.../...': failed to create mountpoint: Read-only file system” error message, even when using “zfs mount -o readonly=on -o mountpoint=/tmp/.....”.
Some more details:
One of the systems is a single-board computer with a couple of SATA interfaces, one native and some others on an M.2 PCIe-to-SATA expander board. The zpools on that one recently stopped working completely, first by suspending the pool due to a lot of I/O errors, and then by not booting up at all after a reboot (getting stuck on an “import zpools based on cache” step). I don’t know exactly what went wrong when, but I did replace the SATA controller board because the old one broke. However, that should not have affected the zpool connected directly to the mainboard. The problem according to the kernel log is this on import:
VERIFY0(0 == dmu_buf_hold_array(os, object, offset, size, FALSE, FTAG, &numbufs, &dbp)) failed (0 == 5)
PANIC at dmu.c:1144:dmu_write()
Showing stack for process 9595
CPU: 0 PID: 9595 Comm: txg_sync Tainted: P O 6.8.0-90-generic #91~22.04.1-Ubuntu
… (and something relating to “space_map_write” → “dmu_write”, which I haven’t looked into)
, which I can trace to here:
First I thought it was the SBC and/or the SATA expander board, but I moved the disk to my main desktop system, and it gave me the same error.
I can, however, import the zpool in readonly mode (the zpool shall be readonly, not the filesystems contained on it, it’s an easy syntax error to make), so the data on it is probably not completely gone.
(But I’m not going to, since this is a backup disk with almost no “original” data on it, and all primary sources for the backups are still up and running. And repairing a zpool, even if possible, seems more risky than recreating it from scratch.)