Can this pool be saved?

TL;DR - No. The pool could not be saved in the face of both drives malfunctioning. However contents have been restored from recent (previous day) backups. Details below for your viewing pleasure.

Not a rhetorical question, unfortunately. A couple months ago one of the drives (in the mirror) started playing up. When I looked into warranty, it was one day past the purchase date. I contacted WD and they provided an RMA number (props to them!) Before I sent it back, I put it in another host and ran diskroaster https://github.com/favoritelotus/diskroaster/ on it and it performed w/out error. I put it back in, added it back to the mirror and watched it resilver and scrub without any issue. I concluded I had a bad cable connection and didn’t return it. Weeks later (and while I was out of town) it stopped responding to SATA commands. On my return, I revived the RMA (which had expired by several days) but before I could pull the drive, the other drive in the mirror started developing reallocated/pending sectors at an alarming rate. The situation was:

hbarta@oak:~$ zpool status tank
  pool: tank
 state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
        repaired.
  scan: scrub repaired 0B in 06:24:10 with 0 errors on Fri Mar 13 02:27:30 2026
  scan: resilvered (mirror-0) 4.29T in 10:48:18 with 0 errors on Thu Mar 12 20:03:20 2026
config:

        NAME                        STATE     READ WRITE CKSUM
        tank                        DEGRADED     0     0     0
          mirror-0                  DEGRADED     0     0     0
            wwn-0x5000cca278d16d38  FAULTED     71   167   538  too many errors
            wwn-0x5000cca291ea5db6  ONLINE       0     0     0

errors: No known data errors
hbarta@oak:~$ 

The first drive on the list is the one with reallocated sectors and the second one is the one that occasionally goes AWOL.

The situation progressed to:

root@oak:/home/hbarta/Programming/Ansible/Pi# zpool status tank -v
  pool: tank
 state: SUSPENDED
status: One or more devices are faulted in response to IO failures.
action: Make sure the affected devices are connected, then run 'zpool clear'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-HC
  scan: resilvered 20.2G in 00:03:44 with 0 errors on Fri Apr  3 13:16:29 2026
config:

        NAME                        STATE     READ WRITE CKSUM
        tank                        DEGRADED     0     0     0
          mirror-0                  DEGRADED     6    40     0
            wwn-0x5000cca278d16d38  FAULTED     65   144   193  too many errors
            wwn-0x5000cca291ea5db6  ONLINE       3    44     0

errors: List of errors unavailable: pool I/O is currently suspended
root@oak:/home/hbarta/Programming/Ansible/Pi# 

After a couple reboots ZFS is recovering beyond my expectations:

root@oak:~# zpool status tank -v
  pool: tank
 state: DEGRADED
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
  scan: resilvered 6.07G in 00:38:49 with 0 errors on Mon Apr  6 15:28:41 2026
config:

        NAME                        STATE     READ WRITE CKSUM
        tank                        DEGRADED     0     0     0
          mirror-0                  DEGRADED     0     0     0
            wwn-0x5000cca278d16d38  DEGRADED     5     0     0  too many errors
            wwn-0x5000cca291ea5db6  ONLINE       0     0     0

errors: No known data errors
root@oak:~# 

At present I have another pool on this host that is a copy of tank. I’ve stopped processes that use the tank (an unexpected advantage of dockerized services) and plan to perform one more backupo of tank to drago_standin, export tank and rename drago_standin to tank and proceed as if everything is normal. Once everything is confirmed working, I’ll probably bring a spare host up that has sufficient drive capacity to copy tank and make another copy.

This is more excitement that I really want on a Monday morning.

Note: The first drive that was playing up is the second one in the status.

Before going down the disk replacement path, it may worthwhile checking the cables, reseating them (after a good round of flexing and dust cleaning). Also, a full long smart test on the disks (export the pool before and leave the disks running till the tests complete) should definitely help confirm if it is disk issue (including aging) or something else.

My perception based on the relatively few errors reported (and the fact that you’re mostly able to use the pool) seems to indicate a more transient/external issue (like cabling) rather than a core disk issue (at this point).

Hopefully things will end well for you.

1 Like

Concur. Between the “dead” disk performing fine in a different system, the “not dead” disk dying a few days later, and the status going from not even being able to count IO errors to “no data errors detected”… I’m pretty sure something north of the actual drives in that system is deranged.

Possible culprits include cables, controller, and power supply. RAM or even CPU are possible but strike me as unlikely given the pattern of failures experienced (and the lack of you reporting issues with the system or apps running on it locking up or crashing).

1 Like

Thanks both for the suggestions. I cannot rule out cabling or other problems at this point. Here’s a little more back story. The topology was:

Local hosts → local backup server → remote backup server

IOW the local hosts backup (using syncoid) to a local backup server. Then the local backup server was sent (again using syncoid) to a remote server at son’s house. Then son moved and remote server was taken offline. As the offline period extended, I created another pool in the local server to stand in for the remote backup. My plan was that when the remote server was back in service, I could easily transport the drives used for the local stand-in to the remote server and continue backups as before. Prior to this issue the topology was:

Local hosts → local backup server → local (stand-in) backup pool

When I started to run into issues with the second HDD in the “local backup server” I made one last ditch attempt to backup to the stand-in which was unsuccessful. Next I exported the faulted pool and the stand-in pool and imported the stand-in pool named as the pool from the local backup server. After tweaking the mount points to match the original pool, everything seems to work. I was able to backup local hosts to this pool. Services such as Forgejo and Checkmk are back up and running. I still need to look at NFS as those attributes seem not to have transferred.

Other things:

  • I have an “experimental” Pi 4B with two HDDs that also has a copy of the pool in question along with other stuff. Though I consider it experimental since it uses USB to connect to the drives, it’s been solid for several years.
  • Earlier this year I copied the pool in question to an HDD and left it at my son’s place. I prodded him a bit and he has brought up the server and slotted that drive in place. I have remote access and plan to update the remote backup using that before I start backing up again over the Internet.
  • I have enough drives to create a pool of sufficient size to backup this pool and plan to do so to provide another local backup.

Once all of these are in place, I will turn my attention back to the failed drives and try to determine if something other than the drives resulted in the issues.

Thanks again. I will follow up.

Edit.0: status update:

  • To the best of my knowledge the local backup pool is working with the stand-in devices.
  • The remote server is back up and the pool has been caught up with local backups as of earlier this year. This morning I will test local → remote backup.
  • I think I will create another pool in the local host using the same power and SATA connectors and backup the main pool to that in order to determine if there is a problem with those. I will keep that backup running on a schedule similar to the main backup.
  • And then (finally) move on to other home lab things that are more fun and interesting.