Zpool status unrecoverable error diag ZFS-8000-9P

SirGeorge · March 30, 2025, 11:52am

A single disk pool used as a remote backup began reporting errors in zpool status:

  pool: rent
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
  scan: scrub repaired 0B in 11:12:15 with 0 errors on Sat Mar 29 21:40:37 2025
config:

        NAME                          STATE     READ WRITE CKSUM
        rent                          ONLINE       0     0     0
          ata-QEMU_HARDDISK_ZVT6TCH0  ONLINE     160     6     0

errors: No known data errors

I ran a scrub, which claimed no errors repaired.

This is the SECOND time this has happened. First was about two weeks ago, I ran a scrub then, didn’t repair any errors, and I did zpool clear on the device.

Output of SMARTctl:

$ sudo smartctl -a /dev/sda

smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.1.0-31-amd64] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     QEMU HARDDISK
Serial Number:    ZVT6TCH0
Firmware Version: 2.5+
User Capacity:    18,000,207,937,536 bytes [18.0 TB]
Sector Size:      512 bytes logical/physical
TRIM Command:     Available, deterministic
Device is:        Not in smartctl database 7.3/5319
ATA Version is:   ATA/ATAPI-7, ATA/ATAPI-5 published, ANSI NCITS 340-2000
Local Time is:    Sun Mar 30 04:45:11 2025 PDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever 
                                        been run.
Total time to complete Offline 
data collection:                (  288) seconds.
Offline data collection
capabilities:                    (0x19) SMART execute Offline immediate.
                                        No Auto Offline data collection support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        No Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        No General Purpose Logging support.
Short self-test routine 
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        (  54) minutes.

SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x0003   100   100   006    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0003   100   100   000    Pre-fail  Always       -       16
  4 Start_Stop_Count        0x0002   100   100   020    Old_age   Always       -       100
  5 Reallocated_Sector_Ct   0x0003   100   100   036    Pre-fail  Always       -       0
  9 Power_On_Hours          0x0003   100   100   000    Pre-fail  Always       -       1
 12 Power_Cycle_Count       0x0003   100   100   000    Pre-fail  Always       -       0
190 Airflow_Temperature_Cel 0x0003   069   069   050    Pre-fail  Always       -       31 (Min/Max 31/31)

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]

Selective Self-tests/Logging not supported

What next steps could I take to solve this? Could this be a cabling issue? The errors are in the READ and WRITE columns but not CKSUM, what’s that telling me (do read/write failures get logged, but then retry and succeed?).

Or is this disk, which I bought as a decertified drive from ServerPartsDeals last fall, maybe on its way out?

Thanks!

mercenary_sysadmin · March 30, 2025, 1:48pm

Could be either the drive, cabling, or controller. But what you’re seeing there is the system itself experiencing hard I/O errors when attempting to read the drive–it’s not getting bad data back, it’s throwing literal hardware I/O errors, and would irrespective of filesystem.

I would advise first replacing the cable, then if you see the error again, try moving the drive to a different port on the controller. If you’re still seeing issues after both things, it’s most likely the drive itself that’s the problem.

It is possible to see hardware I/O errors from any number of issues, including power (bad PSU or bad power delivered to the PSU from the wall), RAM, and more–but in practice, you tend to see it the most with bad SATA cables, followed by bad drives, followed by bad SATA/SAS controllers/ports, with “everything else” trailing pretty distantly behind those.

karl · April 1, 2025, 10:22am

I had a similar error. It was the result of a damaged SATA cable.

Unknown degraded reason - OpenZFS - Practical ZFS