Is it safe to use a replicated copy of a dataset with permanent errors?

McMonster · July 1, 2025, 12:36pm

I’ve started getting weird slowdowns and occasional replication error on one dataset (named ephemeral) on my encrypted main (named hdd) pool recently. But the pool overall was usable, even the affected dataset. More than that, I successfully rsynced the entire pool to a different machine onto a BTRFS filesystem. Also tried to raw send it to another box in the meantime, but that also started failing. Most of the fails reported mismatched snapshots even when names and GUIDs matched. I’ve run scrubs and none of them repaired any data.

My first idea was that it was caused by a dodgy cable or hot-swap bay. So I just replaced the cables and plugged drives directly into a new HBA. Pool still reports permanent errors, but I did sync (not raw this time) the affected dataset with a second, also encrypted pool (named oldhdd) that already had an out of sync copy.

Is it safe to use the replicated copy on oldhdd (that’s where I want to keep that dataset) or should I rather recreate the dataset by rsyncing the original or my BTRFS copy? Snapshots are not important. And is it safe to continue using other datasets on hdd?

Current state is below. Yes, I know I need to upgrade the OS, want to solve current issue first.

# zfs --version
zfs-2.3.2-1
zfs-kmod-2.3.2-1

# uname -a
Linux menator 6.14.5-100.fc40.x86_64 #1 SMP PREEMPT_DYNAMIC Fri May  2 14:22:13 UTC 2025 x86_64 GNU/Linux

# zpool status -vx oldhdd
pool 'oldhdd' is healthy

# zpool status -vx hdd
  pool: hdd
 state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: scrub in progress since Tue Jul  1 13:32:05 2025
        5.50T / 16.2T scanned at 3.07G/s, 30.8G / 16.2T issued at 17.2M/s
        0B repaired, 0.19% done, 11 days 09:49:18 to go
config:

        NAME                        STATE     READ WRITE CKSUM
        hdd                         ONLINE       0     0     0
          mirror-0                  ONLINE       0     0     0
            wwn-0x5000c500c9815a15  ONLINE       0     0     0
            wwn-0x5000039d68d9e476  ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        hdd/ephemeral/vtorrent@syncoid_tao_2025-06-27:15:32:47-GMT02:00:<0x1>
        hdd/ephemeral/vtorrent@autosnap_2025-07-01_13:30:35_hourly:<0x1>
        hdd/ephemeral/vtorrent@autosnap_2025-07-01_12:03:46_hourly:<0x1>
        hdd/ephemeral/vtorrent@autosnap_2025-07-01_14:01:05_hourly:<0x1>
        hdd/ephemeral/vtorrent@syncoid_menator_2025-06-30:00:06:27-GMT02:00:<0x1>
        hdd/ephemeral/vtorrent@syncoid_ubuntu-cinnamon_2025-06-30:18:19:53-GMT00:00:<0x1>
        hdd/ephemeral/vtorrent:<0x1>
        hdd/ephemeral/vtorrent@autosnap_2025-07-01_00:00:46_daily:<0x1>
        hdd/ephemeral/vtorrent@autosnap_2025-06-30_00:01:02_daily:<0x1>
        hdd/ephemeral/vtorrent@autosnap_2025-07-01_11:00:17_hourly:<0x1>

mercenary_sysadmin · July 1, 2025, 6:02pm

From the looks of it, all you really need to do is destroy the affected snapshots and the rest of the dataset should be okay.

McMonster · July 1, 2025, 8:04pm

Thank you. I assume replica of this dataset on another pool should be fine as well. Everything seems to be there when mounted.