Updated, now all of my data is corrupt

bout10bucks · July 11, 2025, 1:40pm

My pools current status for context:

  pool: tank
 state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: resilvered 185M in 00:00:11 with 0 errors on Thu Jul 10 21:13:21 2025
config:

        NAME                        STATE     READ WRITE CKSUM
        tank                        ONLINE       0     0     0
          raidz1-0                  ONLINE       1     0     0
            wwn-0x5000c50066a3d6d3  ONLINE       1     0     0
            wwn-0x50014ee1ac3fa6d5  ONLINE       1     0     0
            wwn-0x50014ee257b8a520  ONLINE       0     0     0

I have a scrub set to run on the first of each month. I don’t have notifications set, so I just manually check when I have time that everything went well. When I took a peek last night, everything looked great and seen that I had some pending updates as well. I applied those:

================================================================================
 Package               Arch      Version                     Repository    Size
================================================================================
Installing:
 kernel                x86_64    4.18.0-553.58.1.el8_10      baseos        10 M
 kernel-core           x86_64    4.18.0-553.58.1.el8_10      baseos        44 M
 kernel-devel          x86_64    4.18.0-553.58.1.el8_10      baseos        24 M
 kernel-modules        x86_64    4.18.0-553.58.1.el8_10      baseos        36 M
Upgrading:
 bpftool               x86_64    4.18.0-553.58.1.el8_10      baseos        11 M
 kernel-headers        x86_64    4.18.0-553.58.1.el8_10      baseos        12 M
 kernel-tools          x86_64    4.18.0-553.58.1.el8_10      baseos        11 M
 kernel-tools-libs     x86_64    4.18.0-553.58.1.el8_10      baseos        10 M
 libblockdev           x86_64    2.28-7.el8_10               appstream    132 k
 libblockdev-crypto    x86_64    2.28-7.el8_10               appstream     81 k
 libblockdev-fs        x86_64    2.28-7.el8_10               appstream     87 k
 libblockdev-loop      x86_64    2.28-7.el8_10               appstream     70 k
 libblockdev-lvm       x86_64    2.28-7.el8_10               appstream     87 k
 libblockdev-mdraid    x86_64    2.28-7.el8_10               appstream     77 k
 libblockdev-part      x86_64    2.28-7.el8_10               appstream     80 k
 libblockdev-swap      x86_64    2.28-7.el8_10               appstream     72 k
 libblockdev-utils     x86_64    2.28-7.el8_10               appstream     80 k
 pam                   x86_64    1.3.1-37.el8_10             baseos       747 k
 platform-python       x86_64    3.6.8-70.el8_10.rocky.0     baseos        88 k
 python3-libs          x86_64    3.6.8-70.el8_10.rocky.0     baseos       7.8 M
 python3-perf          x86_64    4.18.0-553.58.1.el8_10      baseos        11 M
 sos                   noarch    4.9.2-1.el8_10              baseos       986 k
Removing:
 kernel                x86_64    4.18.0-553.53.1.el8_10      @baseos        0  
 kernel-core           x86_64    4.18.0-553.53.1.el8_10      @baseos       71 M
 kernel-devel          x86_64    4.18.0-553.53.1.el8_10      @baseos       53 M
 kernel-modules        x86_64    4.18.0-553.53.1.el8_10      @baseos       25 M

After applying those, I seen that it triggered a dkms and updated zfs from 2.1.x to 2.2.8-1. I got excited about that since 2.2 added ntfy to zed. zpool status said that an upgrade is available, so I applied it. I added my credentials for ntfy in zed.rc and tried triggering an alert using sudo zinject -d wwn-0x5000c50066a3d6d3 -e io -T all -f 100 tank. Which worked great. I got my ntfy notification. I tried clearing the error by running

sudo zinject -c all
sudo zpool clear tank wwn-0x5000c50066a3d6d3

but immediately after, I started getting more error notifications. It was saying files were corrupt, on every snapshot… I checked those files and sure enough, I couldn’t open them, copy them. I was like that stinks, but not the end of the world. I checked another file not listed as corrupted, it wouldn’t open, and another. Now I am scared to touch anything. Smart is saying the drives are fine. Everything was working fine until I did the above.

My question(s), what did I do wrong? and how can I fix it?

Thank you for any help

bout10bucks · July 11, 2025, 1:41pm

I don’t have a backup, I am unfortunately operating with a shoestring budget. Actually, no budget…

mercenary_sysadmin · July 11, 2025, 2:18pm

What zpool status is showing you is hardware Read I/O errors on two of your three disks. And unless you issued a zpool upgrade command yourself after making the software upgrade, the upgrade itself did not change any of the data on disk.

Definitely stop using zpool clear commands; those do not ever fix anything, they just destroy the record of what’s already gone wrong. You never, ever issue a zpool clear until after you’re confident that a problem has been successfully resolved.

Sorry about your data, I know how miserable an experience data loss is. I’d advise you at this point to try booting from a FreeBSD installer thumb drive and importing your pool read only in that environment. If the pool looks fine and your files are accessible there, then you can try booting back into your regular OS and investigating potential issues there. In particular, double check to make certain that your zfs kernel module and your zfs userspace libraries and tools are all the SAME version of ZFS–a mismatch due to one of the other being updated to a newer version than its counterpart is one of the more common issues with DKMS based installations.

bout10bucks · July 11, 2025, 4:34pm

Thank you. I am guessing when I injected an error, I may have triggered a resilver? Then the resilver caused 2 of my drives to fail at once…

mercenary_sysadmin · July 11, 2025, 5:45pm

Have you checked yet to see if there is a newer version of the userspace zfs libraries and commands yet, to match the newer version of the kernel module that DKMS built?

bout10bucks · July 12, 2025, 10:56am

I hate to admit it, but I don’t know how to check userspace tooling version. zfs version returns both the same version numbers if that’s what I am looking for.

mercenary_sysadmin · July 12, 2025, 3:33pm

Should be, can you post the result here?

furicle · July 12, 2025, 3:38pm

That’s interesting. I had an alma 8 machine fall over like that after I upgraded it. I assumed disk issue, and rebuilt it under alma 9 with new drives.

Have you tried booting from a live image of something else (Ubuntu, BSD whatever) to see if the issue persists?

bout10bucks · July 12, 2025, 5:02pm

$> zfs version
zfs-2.2.8-1
zfs-kmod-2.2.8-1

bout10bucks · July 12, 2025, 5:03pm

I haven’t had a chance. I don’t have a USB drive, so I’ll have to find one to borrow

bout10bucks · July 12, 2025, 5:21pm

I was also looking at moving over to AlmaLinux 9. I use it for all my servers, but I am always real timid about doing big things with my storage OS. smartctl still shows the drives to be good (old, but good), so I will probably use those for the time being. Then build a wishlist around serverpartdeals.com if there is no other hardware issues. Once I get a USB I’ll try a live OS, then probably go with alma9 since I don’t have to be as cautious at the moment.

mercenary_sysadmin · July 12, 2025, 6:05pm

Yeah, your kernel module and userland utilities are in sync, so that’s not the problem (assuming, of course, you’ve rebooted since you picked up the new kernel).

You really need to try importing the pool into a clean installer environment to see if you can access the data.

bout10bucks · July 12, 2025, 7:40pm

Booted FreeBSD and it’s showing the same. It imported the pool, but couldn’t import the primary dataset. With no bad sectors on the drive, I’ll call this a loss and format

mercenary_sysadmin · July 12, 2025, 8:13pm

Sorry about your data, friend.

Please, please, pretty please–start backing up! The solution here isn’t “make sure it’s a mirror next time” it’s “make sure you’re doing regular backups next time.” I say this with love, not to yell at you. <3

bout10bucks · July 12, 2025, 10:12pm

No worries. “The road to data loss is paved with good intentions” or something like that