Recurring permanent errors in healthy zpool

vladnik · October 1, 2023, 11:55pm

Hello everyone,

since setting up a new server I experience some strange behavior.

zpool status shows permanent errors in my snapshots, but no errors are detected for the drives.

  pool: rpool
 state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: scrub repaired 0B in 00:06:10 with 0 errors on Sun Oct  1 00:51:49 2023
config:

        NAME                                                         STATE     READ WRITE CKSUM
        rpool                                                        ONLINE       0     0     0
          mirror-0                                                   ONLINE       0     0     0
            nvme-Samsung_SSD_970_EVO_Plus_2TB_S4J4NX0W800973M-part4  ONLINE       0     0     0
            nvme-Samsung_SSD_970_EVO_Plus_2TB_S4J4NX0W820198A-part4  ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        rpool/vm/vzsmb01@autosnap_2023-10-01_05:00:14_hourly:<0x0>
        rpool/ROOT/ubuntu/opt/unifi-controller@autosnap_2023-10-01_19:00:04_hourly:<0x0>
        rpool/ROOT/ubuntu/opt/nginx@autosnap_2023-10-01_05:00:15_hourly:<0x0>
        rpool/ROOT/ubuntu@autosnap_2023-10-01_05:00:12_hourly:<0x0>
        rpool/ROOT/ubuntu/opt/thelounge@autosnap_2023-10-01_11:00:12_hourly:<0x0>
        rpool/ROOT/ubuntu/opt/nginx@autosnap_2023-10-01_04:00:04_hourly:<0x0>

The snapshots do indeed have some kind of error, as replication via syncoid fails with either broken pipe or I/O error.

This is the third time this happened.

The first two times I fixed it by destroying the offending snapshots and scrubbing the pool twice, because the errors remained after the first scrub. zpool clear did nothing, even after deleting the offending snapshots and scrubbing once. Both times the scrub repaired 0B of data.

I want to find out what is going on, but my google-fu is failing me.

Here are the specs and logs for the machine in question:

System: Lenovo ThinkStation P330 Tiny (latest firmware)
CPU: Intel Core i7-9700T
RAM: G.SKILL 2x 32GB DDR4-2666 CL19 | F4-2666C19D-64GRS (tested extensively with MemTest86 & MemTest86+ before deploying at the start of September)
Disks: 2x Samsung 970 EVO Plus 2TB | MZ-V7S2T0BW (latest firmware)

OS: Ubuntu 22.04.3 LTS
Kernel: Ubuntu HWE 6.2.0-33-generic
zfs: 2.1.9-2ubuntu1.1
sanoid: 2.2.0 (Getopt::Long::GetOptions version 2.52; Perl version 5.34.0 | grabbed the .deb from mantic to get the new version)

The system is a docker and kvm host using root on zfs with zfsbootmenu and native encryption.

Exact command to create the pool:

zpool create -o ashift=12 -o autotrim=on -o autoexpand=on -o compatibility=openzfs-2.1-linux -O encryption=on -O keylocation=file:///dev/disk/by-partlabel/rpool.key -O keyformat=passphrase -O xattr=sa -O acltype=posixacl  -O compression=lz4 -O dnodesize=auto -O normalization=formD -O atime=off -O canmount=off -O mountpoint=/ -R /mnt rpool mirror $DISK0-part3 $DISK1-part3

My disk layout:

GPT fdisk (gdisk) version 1.0.8

Partition table scan:
  MBR: protective
  BSD: not present
  APM: not present
  GPT: present

Found valid GPT with protective MBR; using GPT.
Disk /dev/nvme0n1: 3907029168 sectors, 1.8 TiB
Model: Samsung SSD 970 EVO Plus 2TB            
Sector size (logical/physical): 512/512 bytes
Disk identifier (GUID): B5F94C51-45D6-4C9B-8D26-B325ABAEBB34
Partition table holds up to 128 entries
Main partition table begins at sector 2 and ends at sector 33
First usable sector is 34, last usable sector is 3907029134
Partitions will be aligned on 2048-sector boundaries
Total free space is 4061 sectors (2.0 MiB)

Number  Start (sector)    End (sector)  Size       Code  Name
   1            2048            2048   512 bytes   8300  rpool.key
   2            4096         4198399   2.0 GiB     EF00  
   3         4198400        71307263   32.0 GiB    FD00  
   4        71307264      3907029134   1.8 TiB     BF00

NAME        MAJ:MIN RM  SIZE RO TYPE  MOUNTPOINTS
nvme0n1     259:0    0  1.8T  0 disk  
|-nvme0n1p1 259:2    0  512B  0 part  
|-nvme0n1p2 259:3    0    2G  0 part  /boot/efi1
|-nvme0n1p3 259:4    0   32G  0 part  
| `-md0       9:0    0 63.9G  0 raid0 [SWAP]
`-nvme0n1p4 259:5    0  1.8T  0 part  
nvme1n1     259:1    0  1.8T  0 disk  
|-nvme1n1p1 259:6    0  512B  0 part  
|-nvme1n1p2 259:7    0    2G  0 part  /boot/efi
|-nvme1n1p3 259:8    0   32G  0 part  
| `-md0       9:0    0 63.9G  0 raid0 [SWAP]
`-nvme1n1p4 259:9    0  1.8T  0 part

dmesg
journalctl

Any input is highly appreciated. Thank you in advance for your time.

mercenary_sysadmin · October 5, 2023, 1:08am

zpool clear did nothing

No, zpool clear did exactly what it was supposed to do–it cleared the error count in zpool status, which is (at least one reason) why you see zeroes in the error count columns.

native encryption

There is at least one open, unsolved bug with native encryption and (raw send) replication, which results in mysterious errors much like the ones you’re seeing here. This bug is in ZFS replication, not syncoid itself–syncoid literally can’t introduce errors during replication; it’s not in a position to. Once actual replication begins, it’s 100% a ZFS operation; all syncoid does is orchestrate building the commands for you.

vladnik · October 5, 2023, 8:00pm

Edit: Sorry, I’m still learning Discourse and the best way to respond properly.

@mercenary_sysadmin Sorry, I should have been more clear about what I did.

Noticed syncoid errors in logs
Checked zpool status → no READ/WRITE/CKSUM errors in output, just permanent errors in snapshots like shown above.
zpool scrub rpool → 0B repaired, still no errors in READ/WRITE/CKSUM columns, permanent errors still there
zfs destroy affected snapshots → permanent errors now show only numbers
zpool clear rpool → permanent errors still there
second zpool scrub rpool → 0B repaired, still no READ/WRITE/CKSUM errors, permanent errors also gone.

Thanks for clearing up my misunderstanding about how syncoid works. After reading the documentation again, the only reference to a (non) raw send is the --preserve-properties argument, which I am not using. Does invoking syncoid with syncoid rpool bkpusr@vzqnap01:dpool/backup/vzsrv02 --recursive --skip-parent --no-sync-snap --preserve-recordsize --no-privilege-elevation --sshkey /home/bkpusr/.ssh/backup --quiet constitute a raw send?

Also, is this the bug you are referring to? cannot send raw incremental send after creation of snapshot at remote end · Issue #8758 · openzfs/zfs · GitHub

eeemil · October 6, 2023, 9:10am

Was just about to make a post but this issue seems rather similar to mine.

I’ve been running ZFS on boot for about 2 years now on Ubuntu. I have 2 NVMe drives (both Samsung 980 pro) mounted directly on the motherboard (Asus prime Pro x570). After running for a couple of weeks, I’ve always started to get permanent errors after scrubs. My current install is approximately one month old from a fresh install, and I have 1136 data errors where most of them are just hexcodes (<0x1e500>:<0x0>) but a few of them have snapshot names (computername-zroot/ROOT/ubuntu@autosnap_2023-10-05_11:00:08_hourly:<0x0>)

When I started seeing this error a long time ago, I bought the second NVMe (e.g. not from the same batch) and ran a mirrored pool, hoping it would at least add some redundancy and ability to correct errors. But there were still permanent errors that couldn’t be recovered.

I just ignored the errors for a while, as all important data was on another pool backed up (without any errors), but now I’ve decided to try getting to the bottom of this issue. Currently I’m running single disk NVMe again on root, with zfsbootmenu.
I have syncoid+sanoid for sending snapshots first to a local ssd multiple times a day, then sending it to a spinning rust pool twice a day. Native encryption on all pools, but I’m not using raw send (or at least I don’t think I am, my command looks similar to @vladnik). This setup works for a while, until I start to get the same I/O errors as described in this post (and can be temporarily resolved by removing/relocating the backup at the destination and start again, and it works until I get errors again after a couple of days/weeks)

I only get errors on my root pool. I’m thinking maybe that this is a hardware error, possible due to the NVMe getting hot due to a lot of writes. I’ve previously been able to completely disable the NVMe disk by running IO-heavy tasks (computer completely frozen, on reboot the NVMe didn’t even show up in UEFI until I turned the computer off for ~30 seconds)

So to summarize: i’ve tried with either of the NVMe cards as single-disk root pool, and I’ve tried to use them mirrored, always had the same (or similar?) issue.

I’m thinking of buying a PCIe NVMe card with proper cooling to see if it solves anything.

rdw · October 8, 2023, 10:23pm

As others have mentioned there is a known issue with native encryption and Syncoid. I’ve personally experienced the issue on TrueNAS Core, Debian 11, and Ubuntu 22.04. The only solution I have found is to abandon native encryption then create the pool on top of LUKS encrypted disks. I have seen no errors on the same disks when using LUKS.

There are two OpenZFS issues I follow hoping to see a fix some day:

ZFS corruption related to snapshots post-2.0.x upgrade · Issue #12014 · openzfs/zfs · GitHub
permanent errors (ereport.fs.zfs.authentication) reported after syncoid snapshot/send workload · Issue #11688 · openzfs/zfs · GitHub

vladnik · October 17, 2023, 7:43am

Thanks everyone for chiming in on this, much appreciated.
I have bitten the bullet and re-setup all my affected servers (3 in total, different hardware) with LUKS instead of native encryption. So far all is well, no more permanent errors