ZFS I/O Error, Kernel Panic during import

Jealous_Donut_7128 · July 1, 2023, 1:10pm

I’m running a raidz1-0 (RAID5) setup with 4 data 2TB SSDs on CentOS.

During midnight, somehow 2 of my data disks experience some I/O error (from /var/log/messages).

When I investigated in the morning, the zpool status shows the following :

 state: SUSPENDED
status: One or more devices are faulted in response to IO failures.
action: Make sure the affected devices are connected, then run 'zpool clear'.
   see: http://zfsonlinux.org/msg/ZFS-8000-HC
  scan: resilvered 1.36T in 0 days 04:23:23 with 0 errors on Thu Apr 20 21:40:48 2023
config:

        NAME        STATE     READ WRITE CKSUM
        zfs51       UNAVAIL      0     0     0  insufficient replicas
          raidz1-0  UNAVAIL     36     0     0  insufficient replicas
            sdc     FAULTED     57     0     0  too many errors
            sdd     ONLINE       0     0     0
            sde     UNAVAIL      0     0     0
            sdf     ONLINE       0     0     0

errors: List of errors unavailable: pool I/O is currently suspended

I tried doing zpool clear, I keep getting the error message cannot clear errors for zfs51: I/O error

Subsequently, I tried rebooting first to see if it resolves - however there was issue shut-downing.

As a result, I had to do a hard reset. When the system boot back up, the pool was not imported.

Doing zpool import zfs51 now returns me :

        Destroy and re-create the pool from
        a backup source.

Even putting -f or -F, I get the same error. Strangely, when I do zpool import -F, it shows the pool and all the disks online :

   pool: zfs51
     id: 12204763083768531851
  state: ONLINE
 action: The pool can be imported using its name or numeric identifier.
 config:

        zfs51       ONLINE
          raidz1-0  ONLINE
            sdc     ONLINE
            sdd     ONLINE
            sde     ONLINE
            sdf     ONLINE

Yet however, when importing by the pool name, the same error shows.

Even tried using -fF, doesn’t work.

After scrawling through Google and reading up on different various ZFS issues, i stumbled upon the -X flag command (that solves users facing similar issue).

I went ahead to run zpool import -fFX zfs51 and the command seems to be taking long.However, I noticed the 4 data disks having high read activity, which I assume its due to ZFS reading the entire data pool. But after 7 hours, all the read activity on the disks stopped.

I also noticed a ZFS kernel panic message :

 kernel:PANIC: zfs: allocating allocated segment(offset=6859281825792 size=49152) of (offset=6859281825792 size=49152)

Currently, the command zpool import -fFX zfs51 seems to be still running (terminal did not return back the input to me). However, there doesnt seem to be any activity in the disks. Also running zpool status in another terminal seems to hanged as well.

I’m not sure what do at the moment - should I continue waiting (it has been almost 14 hours since I started the import command), or should I do another hard reset/reboot?
Also, I read that potentially I can actually import the pool as readonly (zpool import -o readonly=on -f POOLNAME) and salvage the data - anyone can any advise on that?
I’m guessing both of my data disks potentially got spoilt (somehow at the same timing) - how likely is this the case, or could it be due to ZFS issue?

mercenary_sysadmin · July 1, 2023, 1:15pm

Yes, it’s entirely possible to have lost the pool due to losing both disks overnight (which you were saying was confirmed with log messages). Losing two disks simultaneously is fairly uncommon, but can happen by single point of failure either due to power events or to disk controller issues.

I’m very sorry, but that pool is probably toast.