Unable to run fsck on zfs disk

xrd · December 1, 2024, 1:42pm

I have a disk that is used inside an incus (formerly lxd) VM (qemu). I can no longer get the VM to boot properly, it drops me into an initramfs. I can run fsck /dev/sda2 -y and that completes and indicates it found things to fix. But then it errors out and indicates it cannot write.

11 ref count is 156, should be 135.  Fix? yes

Inode 1584842 ref count is 69, should be 56.  Fix? yes

Pass 5: Checking group summary information
Free blocks count wrong (11599040, counted=11602313).
Fix? yes

Free inodes count wrong (6998947, counted=7000323).
Fix? yes

Error writing file system info: Input/output error

rootfs: ***** FILE SYSTEM WAS MODIFIED *****
(initramfs)

Are there ways I can approach fixing this drive outside of incus?

I can see my pool and it was in a degraded state but I ran zpool scrub and then zpool clear. Then zpool list indicates things are “good”.

xrd@biggpu:~$ zpool list
NAME            SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
default2        186G  64.0G   122G        -      372G    62%    34%  1.00x    ONLINE  -
incus-default  29.5G   904K  29.5G        -         -     0%     0%  1.00x    ONLINE  -

However, the disk itself isn’t fixed, which isn’t surprising.

Are there steps I can use to get at this disk outside of incus/lxc?

Could I copy the disk into another disk and then try to repair that one? I saw some other discussions where this error was indicative of a full disk but I don’t see that here and don’t know how to enlarge that specific disk inside the zpool.

Thanks

xrd · December 1, 2024, 4:57pm

This sounds discouraging.

xrd@biggpu:~$ sudo zpool status default2 -v
  pool: default2
 state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: scrub repaired 0B in 02:56:15 with 133 errors on Sun Dec  1 00:13:09 2024
config:

        NAME                                                                  STATE     READ WRITE CKSUM
        default2                                                              ONLINE       0     0     0
          /media/xrd/734204/737204/var/snap/lxd/common/lxd/disks/default.img  ONLINE       0     0 2.40K

errors: Permanent errors have been detected in the following files:
        <0xfc0d>:<0x1>

mercenary_sysadmin · December 1, 2024, 9:22pm

Fsck is only for ext, not for zfs. ZFS isn’t vulnerable to the type of inconsistency that fsck can detect and fix. I don’t know anything about your distro, so I can’t help any further there. You could maybe try booting from a thumb drive, and running fsck against your OS from there?

Permanent corruption detected in zpool status means exactly what it says it does. If you don’t have backup to restore from, you’re stuck with whatever you can copy off of the pool as-is.

bladewdr · December 3, 2024, 5:25pm

It’s also concerning that it’s only showing the hex code rather than the actual name of the file that’s corrupted.

What does that normally imply again? That the metadata block for the file is what got corrupted?

mercenary_sysadmin · December 3, 2024, 6:15pm

Yeah, a hex address rather than a file name means that it’s either a metadata block, or possibly a block in a zvol. Not sure about that latter; I don’t use zvols in production so I haven’t had as much chance to observe their behavior in rare corruption events.

xrd · December 3, 2024, 6:41pm

Not sure what I did, but I was able to run fsck again and the drive came up. Thanks for your help. I backed up my data but it seems to be running fine now. Very strange.