Pool in FAULTED state after power cycle due to 1 UNAVAIL disk - cannot offline it thus cannot replace it

syborg3d · July 13, 2023, 1:00am

I have a raidz1 pool with 3 disks. I powered down this server to do some physical maintenance unrelated to disks and upon power up the pool did not exist. “zpool import” shows:

   pool: platters
     id: 9995614438679844999
  state: FAULTED
 status: One or more devices contains corrupted data.
 action: The pool cannot be imported due to damaged devices or data.
	The pool may be active on another system, but can be imported using
	the '-f' flag.
   see: http://zfsonlinux.org/msg/ZFS-8000-5E
 config:

	platters                               FAULTED  corrupted data
	  raidz1-0                             FAULTED  corrupted data
	    ata-WDC_WD80EDAZ-11TA3A0_VGG3KDVG  UNAVAIL
	    ata-WDC_WD80EDAZ-11TA3A0_VGGEMKNG  ONLINE
	    sdd                                ONLINE

This disk appears to be physically fully dead (cannot enumerate on bus). Attemping to do “zpool import -f platters” gives me:

internal error: Invalid exchange
Aborted

Checking this against the dbgmsg log:

1689207278   spa.c:6242:spa_tryimport(): spa_tryimport: importing platters
1689207278   spa_misc.c:418:spa_load_note(): spa_load($import, config trusted): LOADING
1689207279   vdev.c:152:vdev_dbgmsg(): disk vdev '/dev/disk/by-id/ata-WDC_WD80EDAZ-11TA3A0_VGG3KDVG-part1': open error=2 timeout=1000000881/1000000000
1689207279   vdev.c:152:vdev_dbgmsg(): disk vdev '/dev/sdd1': best uberblock found for spa $import. txg 4332344
1689207279   spa_misc.c:418:spa_load_note(): spa_load($import, config untrusted): using uberblock with txg=4332344
1689207280   vdev.c:152:vdev_dbgmsg(): disk vdev '/dev/disk/by-id/ata-WDC_WD80EDAZ-11TA3A0_VGG3KDVG-part1': open error=2 timeout=1000001090/1000000000
1689207280   vdev.c:155:vdev_dbgmsg(): raidz-0 vdev (guid 17603172039307171397): unable to read the metaslab array [error=52]
1689207280   vdev.c:155:vdev_dbgmsg(): raidz-0 vdev (guid 17603172039307171397): vdev_load: metaslab_init failed [error=52]
1689207280   spa_misc.c:403:spa_load_failed(): spa_load($import, config trusted): FAILED: vdev_load failed [error=52]
1689207280   spa_misc.c:418:spa_load_note(): spa_load($import, config trusted): UNLOADING
1689207284   spa.c:6242:spa_tryimport(): spa_tryimport: importing platters
1689207284   spa_misc.c:418:spa_load_note(): spa_load($import, config trusted): LOADING
1689207285   vdev.c:152:vdev_dbgmsg(): disk vdev '/dev/disk/by-id/ata-WDC_WD80EDAZ-11TA3A0_VGG3KDVG-part1': open error=2 timeout=1000000144/1000000000
1689207285   vdev.c:152:vdev_dbgmsg(): disk vdev '/dev/sdd1': best uberblock found for spa $import. txg 4332344
1689207285   spa_misc.c:418:spa_load_note(): spa_load($import, config untrusted): using uberblock with txg=4332344
1689207286   vdev.c:152:vdev_dbgmsg(): disk vdev '/dev/disk/by-id/ata-WDC_WD80EDAZ-11TA3A0_VGG3KDVG-part1': open error=2 timeout=1000631147/1000000000
1689207286   vdev.c:155:vdev_dbgmsg(): raidz-0 vdev (guid 17603172039307171397): unable to read the metaslab array [error=52]
1689207286   vdev.c:155:vdev_dbgmsg(): raidz-0 vdev (guid 17603172039307171397): vdev_load: metaslab_init failed [error=52]
1689207286   spa_misc.c:403:spa_load_failed(): spa_load($import, config trusted): FAILED: vdev_load failed [error=52]
1689207286   spa_misc.c:418:spa_load_note(): spa_load($import, config trusted): UNLOADING
1689207286   spa.c:6098:spa_import(): spa_import: importing platters
1689207286   spa_misc.c:418:spa_load_note(): spa_load(platters, config trusted): LOADING
1689207287   vdev.c:152:vdev_dbgmsg(): disk vdev '/dev/disk/by-id/ata-WDC_WD80EDAZ-11TA3A0_VGG3KDVG-part1': open error=2 timeout=1000000201/1000000000
1689207287   vdev.c:152:vdev_dbgmsg(): disk vdev '/dev/sdd1': best uberblock found for spa platters. txg 4332344
1689207287   spa_misc.c:418:spa_load_note(): spa_load(platters, config untrusted): using uberblock with txg=4332344
1689207288   vdev.c:152:vdev_dbgmsg(): disk vdev '/dev/disk/by-id/ata-WDC_WD80EDAZ-11TA3A0_VGG3KDVG-part1': open error=2 timeout=1000000921/1000000000
1689207288   vdev.c:155:vdev_dbgmsg(): raidz-0 vdev (guid 17603172039307171397): unable to read the metaslab array [error=52]
1689207288   vdev.c:155:vdev_dbgmsg(): raidz-0 vdev (guid 17603172039307171397): vdev_load: metaslab_init failed [error=52]
1689207288   spa_misc.c:403:spa_load_failed(): spa_load(platters, config trusted): FAILED: vdev_load failed [error=52]
1689207288   spa_misc.c:418:spa_load_note(): spa_load(platters, config trusted): UNLOADING
1689207288   spa_misc.c:418:spa_load_note(): spa_load(platters, config trusted): spa_load_retry: rewind, max txg: 4332343
1689207288   spa_misc.c:418:spa_load_note(): spa_load(platters, config trusted): LOADING
1689207289   vdev.c:152:vdev_dbgmsg(): disk vdev '/dev/disk/by-id/ata-WDC_WD80EDAZ-11TA3A0_VGG3KDVG-part1': open error=2 timeout=1000000796/1000000000
1689207289   vdev.c:152:vdev_dbgmsg(): disk vdev '/dev/sdd1': best uberblock found for spa platters. txg 4332331
1689207289   spa_misc.c:418:spa_load_note(): spa_load(platters, config untrusted): using uberblock with txg=4332331
1689207290   vdev.c:152:vdev_dbgmsg(): disk vdev '/dev/disk/by-id/ata-WDC_WD80EDAZ-11TA3A0_VGG3KDVG-part1': open error=2 timeout=1000000408/1000000000
1689207290   vdev.c:152:vdev_dbgmsg(): disk vdev '/dev/disk/by-id/ata-WDC_WD80EDAZ-11TA3A0_VGGEMKNG-part1': vdev_load: vdev_dtl_load failed [error=52]
1689207290   spa_misc.c:403:spa_load_failed(): spa_load(platters, config trusted): FAILED: vdev_load failed [error=52]
1689207290   spa_misc.c:418:spa_load_note(): spa_load(platters, config trusted): UNLOADING

Seems to suggest that it’s reading the config/label on one of the good drives, it learns that this is a 3 disk array and that <dead drive> is a member. Then it tries to read <dead drive> and hits a timeout.

Any attempts at other zpool commands that reference the pool all fail with “no such pool”. e.g.:

zpool offline platters ata-WDC_WD80EDAZ-11TA3A0_VGG3KDVG
cannot open 'platters': no such pool

All available instructions for replacing a disk seems to assume that the pool is online and working. I cannot import this pool, so I cannot “offline” the disk and replace it. So, how do I tell the remaining two disks to give up on the third?

I do have a physical replacement disk available, so I am open to either:
-Replace the failed drive, resilver the pool
-Bring up the 2 drives by themselves as read-only (this should be possible in raidz, no?), transfer data to a backup and then build a new pool.

zdb’s output:

WARNING: ignoring tunable zfs_arc_max (using 4038935552 instead)
platters:
    version: 5000
    name: 'platters'
    state: 0
    txg: 4331450
    pool_guid: 9995614438679844999
    errata: 0
    hostid: 2088271309
    hostname: 'htpc'
    com.delphix:has_per_vdev_zaps
    vdev_children: 1
    vdev_tree:
        type: 'root'
        id: 0
        guid: 9995614438679844999
        create_txg: 4
        children[0]:
            type: 'raidz'
            id: 0
            guid: 17603172039307171397
            nparity: 1
            metaslab_array: 64
            metaslab_shift: 34
            ashift: 12
            asize: 24004646141952
            is_log: 0
            create_txg: 4
            com.delphix:vdev_zap_top: 129
            children[0]:
                type: 'disk'
                id: 0
                guid: 13660960398252877464
                path: '/dev/disk/by-id/ata-WDC_WD80EDAZ-11TA3A0_VGG3KDVG-part1'
                devid: 'ata-WDC_WD80EDAZ-11TA3A0_VGG3KDVG-part1'
                phys_path: 'pci-0000:00:1f.2-ata-2'
                whole_disk: 1
                DTL: 1402
                create_txg: 4
                com.delphix:vdev_zap_leaf: 130
            children[1]:
                type: 'disk'
                id: 1
                guid: 9598384422309144048
                path: '/dev/disk/by-id/ata-WDC_WD80EDAZ-11TA3A0_VGGEMKNG-part1'
                devid: 'ata-WDC_WD80EDAZ-11TA3A0_VGGEMKNG-part1'
                phys_path: 'pci-0000:00:1f.2-ata-3'
                whole_disk: 1
                DTL: 1400
                create_txg: 4
                com.delphix:vdev_zap_leaf: 131
                faulted: 1
            children[2]:
                type: 'disk'
                id: 2
                guid: 15626403881568574750
                path: '/dev/sdd1'
                devid: 'ata-WDC_WD80EDAZ-11TA3A0_VGG2SPZG-part1'
                phys_path: 'pci-0000:00:1f.2-ata-4'
                whole_disk: 1
                DTL: 1399
                create_txg: 4
                com.delphix:vdev_zap_leaf: 132
    features_for_read:
        com.delphix:hole_birth
        com.delphix:embedded_data

muay_throwaway · July 13, 2023, 1:38am

Easiest solution is recovery from backup if that is an option. I would recommend a full block-level backup (e.g., ddrescue) before you proceed.

Try zpool import with the -f and/or -F arguments if you have not already (force import and recovery mode).

If it was just a matter of 1 bad disk, the pool should be showing degraded, not faulted. Looks like there may be metadata corruption on the other disk(s) too. This solution may help. Essentially, disable metadata verification and import the disk. Have a pool ready to accept the data (e.g., a mirror) and offload (not a single disk).

In the future, I would recommend you avoid RAID-Z1. If one disk fails and then another disk has any issues/corruption, you have no redundancy to heal that damage.

rincebrain · July 13, 2023, 3:10pm

First, I’d try manually importing with -d /dev/disk/by-id/ first, since it’s not impossible it’s just making nasty faces from the /dev/sdd name and getting confused. I don’t think it’s especially likely, but that’d be a thing I’d try.

Second, I’d examine the labels on the two remaining disks and see if they’re wildly out of sync with zdb -lu [path to partition 1 on disk] for both disks.

Your problem is not that you have a missing disk, your problem is it’s throwing checksum errors trying to import from the remaining two.

You could try bringing it up with zpool import -o readonly=on as a start, but I don’t think I’d necessarily expect any better out of that without knowing more about why it’s sad (I’d turn up the verbosity on zfs_flags before trying again).

You could also pull the allegedly entirely dead disk and put it in some other machine or enclosure and see if the problem isn’t the disk at all.

Finally, you could try using -F or -X or -T, but I’d usually take full-disk images first or use -o readonly=on with those, because they can step on things in trying to rewind aggressively, and you’d hate to have stepped on your recovery chances.

You don’t mention what version you’re using where, so I can’t guess whether there’s some known-since-fixed bug that bit you or something stranger.