Good evening,
I’d like to draw Jim’s attention to a discussion at ZFS corruption related to snapshots post-2.0.x upgrade · Issue #12014 · openzfs/zfs · GitHub regarding corruption. A number of commenters have identified that the issue seems to be triggered by syncoid
and that is my experience as well. My situation is as follows:
I perform single filesystem transfers from an encrypted pool to a file system on a regular basis. (daily.) The target filesystem is not encrypted and I do not use raw sends. The commands are
for f in Programming Documents Archive
do
time -p /sbin/syncoid --no-privilege-elevation --recursive \
hbarta@rocinante:rpool/home/hbarta/$f \
tank/srv/rocinante/$f
done
These backups never seem to provoke any problems.
From time to time I enable full pool backups to my desktop using the following command:
/bin/time -p /sbin/syncoid --recursive --no-privilege-elevation rocinante:rpool olive:ST8TB-ZA20HR7B/rocinante-backup/rpool
I’ve enabled this over about a year and eventually I see “permanent errors” and that prompts me to to disable this full pool backup. I tried this recently with ZVS 2.2.6 and the problem still happens. Between snapshots rollong off and scrubs, these permanent errors always go away. There doesn’t seem to be any issue with the receiving end and no errors are ever reported at the receiving end.
ISTR that syncoid
doesn’t use a recursive send but walks the filesystem tree and sends each filesystem in a separate operation. My suspicion is that something about the way syncoid
operates exposes a race condition in the underlying ZFS code. I also haven’t done anything to prevent concurrent operation of these scripts and/or sanoid
snapshots, so that could be leading to the issue as well.
Anyway, if there is anything you can add to the conversation, please do so! This corruption bug has been around for too long.