Can this pool be saved?

HankB · April 6, 2026, 9:20pm

TL;DR - No. The pool could not be saved in the face of both drives malfunctioning. However contents have been restored from recent (previous day) backups. Details below for your viewing pleasure.

Not a rhetorical question, unfortunately. A couple months ago one of the drives (in the mirror) started playing up. When I looked into warranty, it was one day past the purchase date. I contacted WD and they provided an RMA number (props to them!) Before I sent it back, I put it in another host and ran diskroaster https://github.com/favoritelotus/diskroaster/ on it and it performed w/out error. I put it back in, added it back to the mirror and watched it resilver and scrub without any issue. I concluded I had a bad cable connection and didn’t return it. Weeks later (and while I was out of town) it stopped responding to SATA commands. On my return, I revived the RMA (which had expired by several days) but before I could pull the drive, the other drive in the mirror started developing reallocated/pending sectors at an alarming rate. The situation was:

hbarta@oak:~$ zpool status tank
  pool: tank
 state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
        repaired.
  scan: scrub repaired 0B in 06:24:10 with 0 errors on Fri Mar 13 02:27:30 2026
  scan: resilvered (mirror-0) 4.29T in 10:48:18 with 0 errors on Thu Mar 12 20:03:20 2026
config:

        NAME                        STATE     READ WRITE CKSUM
        tank                        DEGRADED     0     0     0
          mirror-0                  DEGRADED     0     0     0
            wwn-0x5000cca278d16d38  FAULTED     71   167   538  too many errors
            wwn-0x5000cca291ea5db6  ONLINE       0     0     0

errors: No known data errors
hbarta@oak:~$

The first drive on the list is the one with reallocated sectors and the second one is the one that occasionally goes AWOL.

The situation progressed to:

root@oak:/home/hbarta/Programming/Ansible/Pi# zpool status tank -v
  pool: tank
 state: SUSPENDED
status: One or more devices are faulted in response to IO failures.
action: Make sure the affected devices are connected, then run 'zpool clear'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-HC
  scan: resilvered 20.2G in 00:03:44 with 0 errors on Fri Apr  3 13:16:29 2026
config:

        NAME                        STATE     READ WRITE CKSUM
        tank                        DEGRADED     0     0     0
          mirror-0                  DEGRADED     6    40     0
            wwn-0x5000cca278d16d38  FAULTED     65   144   193  too many errors
            wwn-0x5000cca291ea5db6  ONLINE       3    44     0

errors: List of errors unavailable: pool I/O is currently suspended
root@oak:/home/hbarta/Programming/Ansible/Pi#

After a couple reboots ZFS is recovering beyond my expectations:

root@oak:~# zpool status tank -v
  pool: tank
 state: DEGRADED
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
  scan: resilvered 6.07G in 00:38:49 with 0 errors on Mon Apr  6 15:28:41 2026
config:

        NAME                        STATE     READ WRITE CKSUM
        tank                        DEGRADED     0     0     0
          mirror-0                  DEGRADED     0     0     0
            wwn-0x5000cca278d16d38  DEGRADED     5     0     0  too many errors
            wwn-0x5000cca291ea5db6  ONLINE       0     0     0

errors: No known data errors
root@oak:~#

At present I have another pool on this host that is a copy of tank. I’ve stopped processes that use the tank (an unexpected advantage of dockerized services) and plan to perform one more backupo of tank to drago_standin, export tank and rename drago_standin to tank and proceed as if everything is normal. Once everything is confirmed working, I’ll probably bring a spare host up that has sufficient drive capacity to copy tank and make another copy.

This is more excitement that I really want on a Monday morning.

Note: The first drive that was playing up is the second one in the status.

zeefizz · April 7, 2026, 3:29am

Before going down the disk replacement path, it may worthwhile checking the cables, reseating them (after a good round of flexing and dust cleaning). Also, a full long smart test on the disks (export the pool before and leave the disks running till the tests complete) should definitely help confirm if it is disk issue (including aging) or something else.

My perception based on the relatively few errors reported (and the fact that you’re mostly able to use the pool) seems to indicate a more transient/external issue (like cabling) rather than a core disk issue (at this point).

Hopefully things will end well for you.

mercenary_sysadmin · April 7, 2026, 2:02pm

Concur. Between the “dead” disk performing fine in a different system, the “not dead” disk dying a few days later, and the status going from not even being able to count IO errors to “no data errors detected”… I’m pretty sure something north of the actual drives in that system is deranged.

Possible culprits include cables, controller, and power supply. RAM or even CPU are possible but strike me as unlikely given the pattern of failures experienced (and the lack of you reporting issues with the system or apps running on it locking up or crashing).

HankB · April 7, 2026, 3:31pm

Thanks both for the suggestions. I cannot rule out cabling or other problems at this point. Here’s a little more back story. The topology was:

Local hosts → local backup server → remote backup server

IOW the local hosts backup (using syncoid) to a local backup server. Then the local backup server was sent (again using syncoid) to a remote server at son’s house. Then son moved and remote server was taken offline. As the offline period extended, I created another pool in the local server to stand in for the remote backup. My plan was that when the remote server was back in service, I could easily transport the drives used for the local stand-in to the remote server and continue backups as before. Prior to this issue the topology was:

Local hosts → local backup server → local (stand-in) backup pool

When I started to run into issues with the second HDD in the “local backup server” I made one last ditch attempt to backup to the stand-in which was unsuccessful. Next I exported the faulted pool and the stand-in pool and imported the stand-in pool named as the pool from the local backup server. After tweaking the mount points to match the original pool, everything seems to work. I was able to backup local hosts to this pool. Services such as Forgejo and Checkmk are back up and running. I still need to look at NFS as those attributes seem not to have transferred.

Other things:

I have an “experimental” Pi 4B with two HDDs that also has a copy of the pool in question along with other stuff. Though I consider it experimental since it uses USB to connect to the drives, it’s been solid for several years.
Earlier this year I copied the pool in question to an HDD and left it at my son’s place. I prodded him a bit and he has brought up the server and slotted that drive in place. I have remote access and plan to update the remote backup using that before I start backing up again over the Internet.
I have enough drives to create a pool of sufficient size to backup this pool and plan to do so to provide another local backup.

Once all of these are in place, I will turn my attention back to the failed drives and try to determine if something other than the drives resulted in the issues.

Thanks again. I will follow up.

Edit.0: status update:

To the best of my knowledge the local backup pool is working with the stand-in devices.
The remote server is back up and the pool has been caught up with local backups as of earlier this year. This morning I will test local → remote backup.
I think I will create another pool in the local host using the same power and SATA connectors and backup the main pool to that in order to determine if there is a problem with those. I will keep that backup running on a schedule similar to the main backup.
And then (finally) move on to other home lab things that are more fun and interesting.

HankB · April 24, 2026, 5:24pm

I’m back from vacation and can continue. (Grand Canyon is grand.)

Status at this point. Both drives from the pool have been returned to the vendor. I have put three spare HDDs in the host that will use the same SATA cables as the “problem” HDD. While the system operates I see in dmesg output:

[Fri Apr 24 02:07:54 2026] sd 0:0:4:0: attempting task abort!scmd(0x00000000e313ed23), outstanding for 32040 ms & timeout 30000 ms
[Fri Apr 24 02:07:54 2026] sd 0:0:4:0: [sde] tag#3026 CDB: Write(16) 8a 00 00 00 00 01 21 23 b3 b8 00 00 00 08 00 00
[Fri Apr 24 02:07:54 2026] scsi target0:0:4: handle(0x000d), sas_address(0x4433221106000000), phy(6)
[Fri Apr 24 02:07:54 2026] scsi target0:0:4: enclosure logical id(0x590b11c0428eea00), slot(5) 
[Fri Apr 24 02:07:54 2026] sd 0:0:4:0: task abort: SUCCESS scmd(0x00000000e313ed23)
[Fri Apr 24 02:07:54 2026] sd 0:0:4:0: attempting task abort!scmd(0x00000000a904bb8d), outstanding for 32128 ms & timeout 30000 ms
[Fri Apr 24 02:07:54 2026] sd 0:0:4:0: [sde] tag#3025 CDB: Write(16) 8a 00 00 00 00 01 45 43 73 98 00 00 00 08 00 00
[Fri Apr 24 02:07:54 2026] scsi target0:0:4: handle(0x000d), sas_address(0x4433221106000000), phy(6)
[Fri Apr 24 02:07:54 2026] scsi target0:0:4: enclosure logical id(0x590b11c0428eea00), slot(5) 
[Fri Apr 24 02:07:54 2026] sd 0:0:4:0: No reference found at driver, assuming scmd(0x00000000a904bb8d) might have completed
[Fri Apr 24 02:07:54 2026] sd 0:0:4:0: task abort: SUCCESS scmd(0x00000000a904bb8d)
[Fri Apr 24 02:07:54 2026] program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO
[Fri Apr 24 02:07:54 2026] sd 0:0:4:0: Power-on or device reset occurred
[Fri Apr 24 02:07:54 2026] program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO
[Fri Apr 24 02:07:54 2026] program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO
[Fri Apr 24 02:07:54 2026] program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO
[Fri Apr 24 02:07:54 2026] program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO
[Fri Apr 24 02:07:55 2026] sd 0:0:4:0: Power-on or device reset occurred

I’ve just replaced the SATA cable that connects four of the 5 HDDs to the HBA. (due to my viewing angle and partition in the case I overlooked one of the HDDs that was on a different SFF8087 cable.) And that did not take long - several events while using syncoid to copy to the “test” pool. The HBA is a “LSI SAS2008” and IIRC might be a PERC H310.

[Fri Apr 24 12:03:36 2026] sd 0:0:2:0: device_block, handle(0x000b)
[Fri Apr 24 12:03:38 2026] sd 0:0:2:0: device_unblock and setting to running, handle(0x000b)
[Fri Apr 24 12:03:38 2026] zio pool=tank_test_b8bb5b80 vdev=/dev/disk/by-id/wwn-0x5000cca255db5ff7-part1 error=5 type=1 offset=270336 size=8192 flags=1245889
[Fri Apr 24 12:03:38 2026] zio pool=tank_test_b8bb5b80 vdev=/dev/disk/by-id/wwn-0x5000cca255db5ff7-part1 error=5 type=1 offset=6001164558336 size=8192 flags=1245889
[Fri Apr 24 12:03:38 2026] sd 0:0:2:0: [sdc] Synchronizing SCSI cache
[Fri Apr 24 12:03:38 2026] sd 0:0:2:0: [sdc] Synchronize Cache(10) failed: Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
[Fri Apr 24 12:03:38 2026] mpt2sas_cm0: mpt3sas_transport_port_remove: removed: sas_addr(0x4433221106000000)
[Fri Apr 24 12:03:38 2026] mpt2sas_cm0: removing handle(0x000b), sas_addr(0x4433221106000000)
[Fri Apr 24 12:03:38 2026] mpt2sas_cm0: enclosure logical id(0x590b11c0428eea00), slot(5)

[Fri Apr 24 12:03:45 2026] mpt2sas_cm0: handle(0xb) sas_address(0x4433221106000000) port_type(0x1)
[Fri Apr 24 12:03:45 2026] scsi 0:0:5:0: Direct-Access     ATA      HGST HDN726060AL W7JH PQ: 0 ANSI: 6
[Fri Apr 24 12:03:45 2026] scsi 0:0:5:0: SATA: handle(0x000b), sas_addr(0x4433221106000000), phy(6), device_name(0x0000000000000000)
[Fri Apr 24 12:03:45 2026] scsi 0:0:5:0: enclosure logical id (0x590b11c0428eea00), slot(5) 
[Fri Apr 24 12:03:45 2026] scsi 0:0:5:0: atapi(n), ncq(y), asyn_notify(n), smart(y), fua(y), sw_preserve(y)
[Fri Apr 24 12:03:45 2026] scsi 0:0:5:0: qdepth(32), tagged(1), scsi_level(7), cmd_que(1)
[Fri Apr 24 12:03:45 2026] sd 0:0:5:0: Attached scsi generic sg2 type 0
[Fri Apr 24 12:03:45 2026]  end_device-0:5: add: handle(0x000b), sas_addr(0x4433221106000000)
[Fri Apr 24 12:03:45 2026] sd 0:0:5:0: Power-on or device reset occurred
[Fri Apr 24 12:03:45 2026] sd 0:0:5:0: [sdc] 11721045168 512-byte logical blocks: (6.00 TB/5.46 TiB)
[Fri Apr 24 12:03:45 2026] sd 0:0:5:0: [sdc] 4096-byte physical blocks
[Fri Apr 24 12:03:45 2026] sd 0:0:5:0: [sdc] Write Protect is off
[Fri Apr 24 12:03:45 2026] sd 0:0:5:0: [sdc] Mode Sense: 7f 00 10 08
[Fri Apr 24 12:03:45 2026] sd 0:0:5:0: [sdc] Write cache: enabled, read cache: enabled, supports DPO and FUA
[Fri Apr 24 12:03:45 2026]  sdc: sdc1 sdc9
[Fri Apr 24 12:03:45 2026] sd 0:0:5:0: [sdc] Attached SCSI disk

[Fri Apr 24 12:11:03 2026] sd 0:0:5:0: device_block, handle(0x000b)
[Fri Apr 24 12:11:05 2026] sd 0:0:5:0: device_unblock and setting to running, handle(0x000b)
[Fri Apr 24 12:11:05 2026] zio pool=tank_test_b8bb5b80 vdev=/dev/disk/by-id/wwn-0x5000cca255db5ff7-part1 error=5 type=1 offset=270336 size=8192 flags=1245889
[Fri Apr 24 12:11:05 2026] zio pool=tank_test_b8bb5b80 vdev=/dev/disk/by-id/wwn-0x5000cca255db5ff7-part1 error=5 type=1 offset=6001164558336 size=8192 flags=1245889
[Fri Apr 24 12:11:05 2026] zio pool=tank_test_b8bb5b80 vdev=/dev/disk/by-id/wwn-0x5000cca255db5ff7-part1 error=5 type=1 offset=6001164820480 size=8192 flags=1245889
[Fri Apr 24 12:11:05 2026] sd 0:0:5:0: [sdc] Synchronizing SCSI cache
[Fri Apr 24 12:11:05 2026] sd 0:0:5:0: [sdc] Synchronize Cache(10) failed: Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
[Fri Apr 24 12:11:05 2026] mpt2sas_cm0: mpt3sas_transport_port_remove: removed: sas_addr(0x4433221106000000)
[Fri Apr 24 12:11:05 2026] mpt2sas_cm0: removing handle(0x000b), sas_addr(0x4433221106000000)
[Fri Apr 24 12:11:05 2026] mpt2sas_cm0: enclosure logical id(0x590b11c0428eea00), slot(5)

[Fri Apr 24 12:11:12 2026] mpt2sas_cm0: handle(0xb) sas_address(0x4433221106000000) port_type(0x1)
[Fri Apr 24 12:11:12 2026] scsi 0:0:6:0: Direct-Access     ATA      HGST HDN726060AL W7JH PQ: 0 ANSI: 6
[Fri Apr 24 12:11:12 2026] scsi 0:0:6:0: SATA: handle(0x000b), sas_addr(0x4433221106000000), phy(6), device_name(0x0000000000000000)
[Fri Apr 24 12:11:12 2026] scsi 0:0:6:0: enclosure logical id (0x590b11c0428eea00), slot(5) 
[Fri Apr 24 12:11:12 2026] scsi 0:0:6:0: atapi(n), ncq(y), asyn_notify(n), smart(y), fua(y), sw_preserve(y)
[Fri Apr 24 12:11:12 2026] scsi 0:0:6:0: qdepth(32), tagged(1), scsi_level(7), cmd_que(1)
[Fri Apr 24 12:11:12 2026] sd 0:0:6:0: Attached scsi generic sg2 type 0
[Fri Apr 24 12:11:12 2026]  end_device-0:6: add: handle(0x000b), sas_addr(0x4433221106000000)
[Fri Apr 24 12:11:12 2026] sd 0:0:6:0: Power-on or device reset occurred
[Fri Apr 24 12:11:12 2026] sd 0:0:6:0: [sdc] 11721045168 512-byte logical blocks: (6.00 TB/5.46 TiB)
[Fri Apr 24 12:11:12 2026] sd 0:0:6:0: [sdc] 4096-byte physical blocks
[Fri Apr 24 12:11:12 2026] sd 0:0:6:0: [sdc] Write Protect is off
[Fri Apr 24 12:11:12 2026] sd 0:0:6:0: [sdc] Mode Sense: 7f 00 10 08
[Fri Apr 24 12:11:12 2026] sd 0:0:6:0: [sdc] Write cache: enabled, read cache: enabled, supports DPO and FUA
[Fri Apr 24 12:11:12 2026]  sdc: sdc1 sdc9
[Fri Apr 24 12:11:12 2026] sd 0:0:6:0: [sdc] Attached SCSI disk

The “Power-on or device reset occurred” message causes me to wonder about power to this HDD and wiggling the connector at the driver results in another event. I think I need to look at that next.

Edit: I redid the power connectors to these HDDs and the disconnects continue and this time on an HDD in the main (important) pool. I think perhaps time to swap in another HBA that is in a test host (and presently not in use.)