`task z_wr_int_N:N blocked for more than N seconds`

I have a Proxmox system that uses ZFS. (The reason for not posting in the Proxmox forum is that I have since concluded that my issue is not caused by or related to Proxmox itself.) Proxmox sets it up like this: A single disk (NVME SSD in my case), with 3 partitions: 1MiB Boot, 1GiB EFI, and the rest a partition for the ZFS pool named rpool. The fdisk -l output is:

Disk model: SSD 980 PRO 1TB
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 33553920 bytes
Disklabel type: gpt
Disk identifier: ...

Device       Start        End    Sectors   Size Type
/dev/sdc1       34       2047       2014  1007K BIOS boot
/dev/sdc2     2048    2099199    2097152     1G EFI System
/dev/sdc3  2099200 1953517568 1951418369 930.5G Solaris /usr & Apple ZFS

The SSD is Samsung 980 Pro 1TB, but it had the latest firmware 5B2QGXA7 from the beginning, so it should not be affected by the Samsung firmware bugs of the past few years.

Earlier this week, the machine froze up for no apparent reason, and I started investigating. There was a kernel panic shown on the physically attached screen:

PANIC: zfs: adding existent segment to range tree (offset=835bb23000 size=b000)

We could not get the system to respond in any way, so we turned it off using the Magic SysRq commands REISUO.

Then I booted from a LinuxMint 22.1 Live USB and started investigating:

  • The SSD seemed healthy, according to nvme command line tool. The percentage_used is at 41, but that should be fine, as far as I know.
  • I could still mount and read the EFI filesystem without issues.
  • zpool import detected the pool, but told me about errors:
      pool: rpool
        id: 1864023278626120
     state: UNAVAIL
    status: The pool was last accessed by another system.
    action: The pool cannot be imported due to damaged devices or data.
       see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-EY
    config:
    
            rpool                              UNAVAIL  unsupported feature(s)
              nvme-eui.002538bc21b5192c-part3  ONLINE
    
  • I solved the “unsupported feature(s)” by upgrading ZFS on the LinuxMint 22.1 recovery system using Latest OpenZFS for Ubuntu : Juhyung Park
  • I tried running zpool import -F -f rpool to recover the data by throwing away the bad transactions. I left the process running for just under 24 hours, after which htop showed that it only consumed 0.32 seconds of CPU time, and that the process is in “Disk Sleep”, which I found odd. I could not kill the process, even with SIGKILL, so I had to forcibly reboot again.
  • To make sure it’s not the motherboard’s NVME slot that is the problem, I put the SSD into a USB-NVME enclosure, and tried again, which yielded the same results: A hanging process. I left that running for some time, too.
  • To make sure it’s not the SSD itself that is broken, I used ddrescue to copy the entire SSD (now at /dev/sdc due to the USB enclosure) to a file, called dump.img, on my healthy 10TB backup drive:
    ddrescue -v -r3 /dev/sdc dump.img dump.log
    
  • I mounted that file as a loop device /dev/loop1 using losetup:
    losetup -fP dump.img
    
  • I tried running the recovery on the loop device, but it is hanging just like before, with no CPU usage, and in Disk Sleep:
    zpool import -d /dev/loop1p3 -F -N -n -f rpool
    
  • I opened another terminal, and interestingly, I can do zfs list and zfs mount there, and browse the mounted datasets.
  • When I try to run zfs unmount or zpool export from new terminals, those also hang indefinitely, in Disk Sleep.
  • Only at this point did I think to look in dmesg (facepalm), where I saw this kind of message repeatedly:
    [Wed Jun 11 11:38:01 2025] INFO: task z_wr_int_3:7586 blocked for more than 122 seconds.
    [Wed Jun 11 11:38:01 2025]       Tainted: P           OE      6.8.0-51-generic #52-Ubuntu
    [Wed Jun 11 11:38:01 2025] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    [Wed Jun 11 11:38:01 2025] task:z_wr_int_3      state:D stack:0     pid:7586  tgid:7586  ppid:2      flags:0x00004000
    [Wed Jun 11 11:38:01 2025] Call Trace:
    [Wed Jun 11 11:38:01 2025]  <TASK>
    [Wed Jun 11 11:38:01 2025]  __schedule+0x27c/0x6b0
    [Wed Jun 11 11:38:01 2025]  schedule+0x33/0x110
    [Wed Jun 11 11:38:01 2025]  schedule_preempt_disabled+0x15/0x30
    [Wed Jun 11 11:38:01 2025]  __mutex_lock.constprop.0+0x42f/0x740
    [Wed Jun 11 11:38:01 2025]  __mutex_lock_slowpath+0x13/0x20
    [Wed Jun 11 11:38:01 2025]  mutex_lock+0x3c/0x50
    [Wed Jun 11 11:38:01 2025]  metaslab_free_concrete+0xdd/0x2d0 [zfs]
    [Wed Jun 11 11:38:01 2025]  metaslab_free_impl+0xc1/0x110 [zfs]
    [Wed Jun 11 11:38:01 2025]  metaslab_free_dva+0x61/0x90 [zfs]
    [Wed Jun 11 11:38:01 2025]  metaslab_free+0xe0/0x1d0 [zfs]
    [Wed Jun 11 11:38:01 2025]  zio_free_sync+0x11d/0x130 [zfs]
    [Wed Jun 11 11:38:01 2025]  zio_free+0xcf/0x100 [zfs]
    [Wed Jun 11 11:38:01 2025]  dsl_free+0x11/0x20 [zfs]
    [Wed Jun 11 11:38:01 2025]  dsl_dataset_block_kill+0x42d/0x630 [zfs]
    [Wed Jun 11 11:38:01 2025]  dbuf_write_done+0x193/0x1d0 [zfs]
    [Wed Jun 11 11:38:01 2025]  arc_write_done+0xa7/0x550 [zfs]
    [Wed Jun 11 11:38:01 2025]  zio_done+0x26e/0x1260 [zfs]
    [Wed Jun 11 11:38:01 2025]  ? spl_kmem_free+0x31/0x40 [spl]
    [Wed Jun 11 11:38:01 2025]  zio_execute+0x94/0x170 [zfs]
    [Wed Jun 11 11:38:01 2025]  taskq_thread+0x333/0x6f0 [spl]
    [Wed Jun 11 11:38:01 2025]  ? __pfx_default_wake_function+0x10/0x10
    [Wed Jun 11 11:38:01 2025]  ? __pfx_zio_execute+0x10/0x10 [zfs]
    [Wed Jun 11 11:38:01 2025]  ? __pfx_taskq_thread+0x10/0x10 [spl]
    [Wed Jun 11 11:38:01 2025]  kthread+0xef/0x120
    [Wed Jun 11 11:38:01 2025]  ? __pfx_kthread+0x10/0x10
    [Wed Jun 11 11:38:01 2025]  ret_from_fork+0x44/0x70
    [Wed Jun 11 11:38:01 2025]  ? __pfx_kthread+0x10/0x10
    [Wed Jun 11 11:38:01 2025]  ret_from_fork_asm+0x1b/0x30
    [Wed Jun 11 11:38:01 2025]  </TASK>
    

I am a bit out of my depth here, and hoping that someone can shine some light on what is going on, and what I should try next.


Update at 2025/06/11 16:54 (SAST)

I started reading about debugging with the zdb command.

I can still see the datasets, without importing the pool. This does not freeze:

zdb -d rpool -e -p /dev/loop1p3
Dataset mos [META], ID 0, cr_txg 4, 322M, 420 objects
Dataset rpool/ROOT/pve-1 [ZPL], ID 643, cr_txg 9, 13.8G, 383359 objects
Dataset rpool/ROOT [ZPL], ID 516, cr_txg 8, 96K, 7 objects
Dataset rpool/data/base-100-disk-0@__base__ [ZVOL], ID 3732, cr_txg 211707, 7.39G, 2 objects
Dataset rpool/data/base-100-disk-0 [ZVOL], ID 3717, cr_txg 211542, 7.39G, 2 objects
Dataset rpool/data/base-102-disk-0@__base__ [ZVOL], ID 2804, cr_txg 213162, 18.9G, 2 objects
Dataset rpool/data/base-102-disk-0 [ZVOL], ID 4154, cr_txg 212733, 18.9G, 2 objects
Dataset rpool/data/vm-107-disk-0@init-20240723 [ZVOL], ID 4951, cr_txg 214705, 16.5G, 2 objects
Dataset rpool/data/vm-107-disk-0 [ZVOL], ID 1906, cr_txg 214454, 21.6G, 2 objects
Dataset rpool/data/vm-105-disk-0@init-20240723 [ZVOL], ID 5566, cr_txg 214703, 16.5G, 2 objects
Dataset rpool/data/vm-105-disk-0 [ZVOL], ID 6401, cr_txg 214450, 23.5G, 2 objects
Dataset rpool/data/vm-108-disk-0@a-20241211-1005 [ZVOL], ID 57929, cr_txg 2592686, 20.5G, 2 objects
Dataset rpool/data/vm-108-disk-0@a-20250331-1501 [ZVOL], ID 57247, cr_txg 4470092, 21.1G, 2 objects
Dataset rpool/data/vm-108-disk-0@a-20250109-1156 [ZVOL], ID 125617, cr_txg 3084383, 21.0G, 2 objects
Dataset rpool/data/vm-108-disk-0@a-20241128-1441 [ZVOL], ID 21503, cr_txg 2376151, 18.1G, 2 objects
Dataset rpool/data/vm-108-disk-0@init-20240807-1925 [ZVOL], ID 139938, cr_txg 470249, 17.9G, 2 objects
Dataset rpool/data/vm-108-disk-0 [ZVOL], ID 137418, cr_txg 468510, 20.1G, 2 objects
Dataset rpool/data/vm-106-disk-0@init-20240723 [ZVOL], ID 1919, cr_txg 214704, 16.5G, 2 objects
Dataset rpool/data/vm-106-disk-0 [ZVOL], ID 5190, cr_txg 214452, 25.2G, 2 objects
Dataset rpool/data/vm-104-disk-0@init-20240723 [ZVOL], ID 5722, cr_txg 214702, 16.5G, 2 objects
Dataset rpool/data/vm-104-disk-0 [ZVOL], ID 4812, cr_txg 214448, 25.3G, 2 objects
Dataset rpool/data/base-101-disk-0@__base__ [ZVOL], ID 3705, cr_txg 212230, 8.66G, 2 objects
Dataset rpool/data/base-101-disk-0 [ZVOL], ID 2428, cr_txg 212033, 8.66G, 2 objects
Dataset rpool/data/base-103-disk-0@__base__ [ZVOL], ID 5172, cr_txg 214421, 16.4G, 2 objects
Dataset rpool/data/base-103-disk-0 [ZVOL], ID 5147, cr_txg 213925, 16.4G, 2 objects
Dataset rpool/data [ZPL], ID 650, cr_txg 10, 96K, 6 objects
Dataset rpool/var-lib-vz [ZPL], ID 771, cr_txg 11, 96K, 11 objects
Dataset rpool [ZPL], ID 54, cr_txg 1, 104K, 8 objects
Verified large_blocks feature refcount of 0 is correct
Verified large_dnode feature refcount of 0 is correct
Verified sha512 feature refcount of 0 is correct
Verified skein feature refcount of 0 is correct
Verified edonr feature refcount of 0 is correct
Verified userobj_accounting feature refcount of 5 is correct
Verified encryption feature refcount of 0 is correct
Verified project_quota feature refcount of 5 is correct
Verified redaction_bookmarks feature refcount of 0 is correct
Verified redacted_datasets feature refcount of 0 is correct
Verified bookmark_written feature refcount of 0 is correct
Verified livelist feature refcount of 0 is correct
Verified zstd_compress feature refcount of 0 is correct
Verified zilsaxattr feature refcount of 1 is correct
Verified blake3 feature refcount of 0 is correct
Verified device_removal feature refcount of 0 is correct
Verified indirect_refcount feature refcount of 0 is correct