I have a Proxmox system that uses ZFS. (The reason for not posting in the Proxmox forum is that I have since concluded that my issue is not caused by or related to Proxmox itself.) Proxmox sets it up like this: A single disk (NVME SSD in my case), with 3 partitions: 1MiB Boot, 1GiB EFI, and the rest a partition for the ZFS pool named rpool
. The fdisk -l
output is:
Disk model: SSD 980 PRO 1TB
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 33553920 bytes
Disklabel type: gpt
Disk identifier: ...
Device Start End Sectors Size Type
/dev/sdc1 34 2047 2014 1007K BIOS boot
/dev/sdc2 2048 2099199 2097152 1G EFI System
/dev/sdc3 2099200 1953517568 1951418369 930.5G Solaris /usr & Apple ZFS
The SSD is Samsung 980 Pro 1TB, but it had the latest firmware 5B2QGXA7
from the beginning, so it should not be affected by the Samsung firmware bugs of the past few years.
Earlier this week, the machine froze up for no apparent reason, and I started investigating. There was a kernel panic shown on the physically attached screen:
PANIC: zfs: adding existent segment to range tree (offset=835bb23000 size=b000)
We could not get the system to respond in any way, so we turned it off using the Magic SysRq commands REISUO
.
Then I booted from a LinuxMint 22.1 Live USB and started investigating:
- The SSD seemed healthy, according to
nvme
command line tool. Thepercentage_used
is at 41, but that should be fine, as far as I know. - I could still mount and read the EFI filesystem without issues.
zpool import
detected the pool, but told me about errors:pool: rpool id: 1864023278626120 state: UNAVAIL status: The pool was last accessed by another system. action: The pool cannot be imported due to damaged devices or data. see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-EY config: rpool UNAVAIL unsupported feature(s) nvme-eui.002538bc21b5192c-part3 ONLINE
- I solved the “unsupported feature(s)” by upgrading ZFS on the LinuxMint 22.1 recovery system using Latest OpenZFS for Ubuntu : Juhyung Park
- I tried running
zpool import -F -f rpool
to recover the data by throwing away the bad transactions. I left the process running for just under 24 hours, after whichhtop
showed that it only consumed 0.32 seconds of CPU time, and that the process is in “Disk Sleep”, which I found odd. I could not kill the process, even with SIGKILL, so I had to forcibly reboot again. - To make sure it’s not the motherboard’s NVME slot that is the problem, I put the SSD into a USB-NVME enclosure, and tried again, which yielded the same results: A hanging process. I left that running for some time, too.
- To make sure it’s not the SSD itself that is broken, I used
ddrescue
to copy the entire SSD (now at /dev/sdc due to the USB enclosure) to a file, calleddump.img
, on my healthy 10TB backup drive:ddrescue -v -r3 /dev/sdc dump.img dump.log
- I mounted that file as a loop device
/dev/loop1
usinglosetup
:losetup -fP dump.img
- I tried running the recovery on the loop device, but it is hanging just like before, with no CPU usage, and in Disk Sleep:
zpool import -d /dev/loop1p3 -F -N -n -f rpool
- I opened another terminal, and interestingly, I can do
zfs list
andzfs mount
there, and browse the mounted datasets. - When I try to run
zfs unmount
orzpool export
from new terminals, those also hang indefinitely, in Disk Sleep. - Only at this point did I think to look in
dmesg
(facepalm), where I saw this kind of message repeatedly:[Wed Jun 11 11:38:01 2025] INFO: task z_wr_int_3:7586 blocked for more than 122 seconds. [Wed Jun 11 11:38:01 2025] Tainted: P OE 6.8.0-51-generic #52-Ubuntu [Wed Jun 11 11:38:01 2025] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [Wed Jun 11 11:38:01 2025] task:z_wr_int_3 state:D stack:0 pid:7586 tgid:7586 ppid:2 flags:0x00004000 [Wed Jun 11 11:38:01 2025] Call Trace: [Wed Jun 11 11:38:01 2025] <TASK> [Wed Jun 11 11:38:01 2025] __schedule+0x27c/0x6b0 [Wed Jun 11 11:38:01 2025] schedule+0x33/0x110 [Wed Jun 11 11:38:01 2025] schedule_preempt_disabled+0x15/0x30 [Wed Jun 11 11:38:01 2025] __mutex_lock.constprop.0+0x42f/0x740 [Wed Jun 11 11:38:01 2025] __mutex_lock_slowpath+0x13/0x20 [Wed Jun 11 11:38:01 2025] mutex_lock+0x3c/0x50 [Wed Jun 11 11:38:01 2025] metaslab_free_concrete+0xdd/0x2d0 [zfs] [Wed Jun 11 11:38:01 2025] metaslab_free_impl+0xc1/0x110 [zfs] [Wed Jun 11 11:38:01 2025] metaslab_free_dva+0x61/0x90 [zfs] [Wed Jun 11 11:38:01 2025] metaslab_free+0xe0/0x1d0 [zfs] [Wed Jun 11 11:38:01 2025] zio_free_sync+0x11d/0x130 [zfs] [Wed Jun 11 11:38:01 2025] zio_free+0xcf/0x100 [zfs] [Wed Jun 11 11:38:01 2025] dsl_free+0x11/0x20 [zfs] [Wed Jun 11 11:38:01 2025] dsl_dataset_block_kill+0x42d/0x630 [zfs] [Wed Jun 11 11:38:01 2025] dbuf_write_done+0x193/0x1d0 [zfs] [Wed Jun 11 11:38:01 2025] arc_write_done+0xa7/0x550 [zfs] [Wed Jun 11 11:38:01 2025] zio_done+0x26e/0x1260 [zfs] [Wed Jun 11 11:38:01 2025] ? spl_kmem_free+0x31/0x40 [spl] [Wed Jun 11 11:38:01 2025] zio_execute+0x94/0x170 [zfs] [Wed Jun 11 11:38:01 2025] taskq_thread+0x333/0x6f0 [spl] [Wed Jun 11 11:38:01 2025] ? __pfx_default_wake_function+0x10/0x10 [Wed Jun 11 11:38:01 2025] ? __pfx_zio_execute+0x10/0x10 [zfs] [Wed Jun 11 11:38:01 2025] ? __pfx_taskq_thread+0x10/0x10 [spl] [Wed Jun 11 11:38:01 2025] kthread+0xef/0x120 [Wed Jun 11 11:38:01 2025] ? __pfx_kthread+0x10/0x10 [Wed Jun 11 11:38:01 2025] ret_from_fork+0x44/0x70 [Wed Jun 11 11:38:01 2025] ? __pfx_kthread+0x10/0x10 [Wed Jun 11 11:38:01 2025] ret_from_fork_asm+0x1b/0x30 [Wed Jun 11 11:38:01 2025] </TASK>
I am a bit out of my depth here, and hoping that someone can shine some light on what is going on, and what I should try next.
Update at 2025/06/11 16:54 (SAST)
I started reading about debugging with the zdb
command.
I can still see the datasets, without importing the pool. This does not freeze:
zdb -d rpool -e -p /dev/loop1p3
Dataset mos [META], ID 0, cr_txg 4, 322M, 420 objects
Dataset rpool/ROOT/pve-1 [ZPL], ID 643, cr_txg 9, 13.8G, 383359 objects
Dataset rpool/ROOT [ZPL], ID 516, cr_txg 8, 96K, 7 objects
Dataset rpool/data/base-100-disk-0@__base__ [ZVOL], ID 3732, cr_txg 211707, 7.39G, 2 objects
Dataset rpool/data/base-100-disk-0 [ZVOL], ID 3717, cr_txg 211542, 7.39G, 2 objects
Dataset rpool/data/base-102-disk-0@__base__ [ZVOL], ID 2804, cr_txg 213162, 18.9G, 2 objects
Dataset rpool/data/base-102-disk-0 [ZVOL], ID 4154, cr_txg 212733, 18.9G, 2 objects
Dataset rpool/data/vm-107-disk-0@init-20240723 [ZVOL], ID 4951, cr_txg 214705, 16.5G, 2 objects
Dataset rpool/data/vm-107-disk-0 [ZVOL], ID 1906, cr_txg 214454, 21.6G, 2 objects
Dataset rpool/data/vm-105-disk-0@init-20240723 [ZVOL], ID 5566, cr_txg 214703, 16.5G, 2 objects
Dataset rpool/data/vm-105-disk-0 [ZVOL], ID 6401, cr_txg 214450, 23.5G, 2 objects
Dataset rpool/data/vm-108-disk-0@a-20241211-1005 [ZVOL], ID 57929, cr_txg 2592686, 20.5G, 2 objects
Dataset rpool/data/vm-108-disk-0@a-20250331-1501 [ZVOL], ID 57247, cr_txg 4470092, 21.1G, 2 objects
Dataset rpool/data/vm-108-disk-0@a-20250109-1156 [ZVOL], ID 125617, cr_txg 3084383, 21.0G, 2 objects
Dataset rpool/data/vm-108-disk-0@a-20241128-1441 [ZVOL], ID 21503, cr_txg 2376151, 18.1G, 2 objects
Dataset rpool/data/vm-108-disk-0@init-20240807-1925 [ZVOL], ID 139938, cr_txg 470249, 17.9G, 2 objects
Dataset rpool/data/vm-108-disk-0 [ZVOL], ID 137418, cr_txg 468510, 20.1G, 2 objects
Dataset rpool/data/vm-106-disk-0@init-20240723 [ZVOL], ID 1919, cr_txg 214704, 16.5G, 2 objects
Dataset rpool/data/vm-106-disk-0 [ZVOL], ID 5190, cr_txg 214452, 25.2G, 2 objects
Dataset rpool/data/vm-104-disk-0@init-20240723 [ZVOL], ID 5722, cr_txg 214702, 16.5G, 2 objects
Dataset rpool/data/vm-104-disk-0 [ZVOL], ID 4812, cr_txg 214448, 25.3G, 2 objects
Dataset rpool/data/base-101-disk-0@__base__ [ZVOL], ID 3705, cr_txg 212230, 8.66G, 2 objects
Dataset rpool/data/base-101-disk-0 [ZVOL], ID 2428, cr_txg 212033, 8.66G, 2 objects
Dataset rpool/data/base-103-disk-0@__base__ [ZVOL], ID 5172, cr_txg 214421, 16.4G, 2 objects
Dataset rpool/data/base-103-disk-0 [ZVOL], ID 5147, cr_txg 213925, 16.4G, 2 objects
Dataset rpool/data [ZPL], ID 650, cr_txg 10, 96K, 6 objects
Dataset rpool/var-lib-vz [ZPL], ID 771, cr_txg 11, 96K, 11 objects
Dataset rpool [ZPL], ID 54, cr_txg 1, 104K, 8 objects
Verified large_blocks feature refcount of 0 is correct
Verified large_dnode feature refcount of 0 is correct
Verified sha512 feature refcount of 0 is correct
Verified skein feature refcount of 0 is correct
Verified edonr feature refcount of 0 is correct
Verified userobj_accounting feature refcount of 5 is correct
Verified encryption feature refcount of 0 is correct
Verified project_quota feature refcount of 5 is correct
Verified redaction_bookmarks feature refcount of 0 is correct
Verified redacted_datasets feature refcount of 0 is correct
Verified bookmark_written feature refcount of 0 is correct
Verified livelist feature refcount of 0 is correct
Verified zstd_compress feature refcount of 0 is correct
Verified zilsaxattr feature refcount of 1 is correct
Verified blake3 feature refcount of 0 is correct
Verified device_removal feature refcount of 0 is correct
Verified indirect_refcount feature refcount of 0 is correct