Troubleshooting zfs recv errors mid copy

furicle · December 1, 2023, 4:58pm

I have one particular daily syncoid send|recv job between two machines that often fails some random period of time after it starts.

The job is usually several hundred gig. Restarting the job works fine, although sometimes the resume fails after a while too, and I have to re-resume etc etc. I do eventually get a complete copy.

Machines are not quality server gear, but connected by a wired gigabyte network via a couple decent switches.

The output just reads “cannot receive incremental stream: checksum mismatch or incomplete stream.”

I thought maybe slower would be more reliable. I’ve randomly played with the limit source bandwith flag, the mbuffer size, and with and without compression, without any real change.

Suggestions?

furicle · December 4, 2023, 4:47pm

Oh this is looking, um, fun…

[Mon Dec  4 11:32:48 2023] INFO: task txg_sync:1469 blocked for more than 120 seconds.
[Mon Dec  4 11:32:48 2023]       Tainted: P           OE    --------- -  - 4.18.0-513.9.1.el8_9.x86_64 #1
[Mon Dec  4 11:32:48 2023] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Mon Dec  4 11:32:48 2023] task:txg_sync        state:D stack:0     pid:1469  ppid:2      flags:0x80004000
[Mon Dec  4 11:32:48 2023] Call Trace:

Found that in dmesg while a sync was running.

Googling that brings up some long threads elsewhere, but no obvious fixes I have found so far.

It’s not lack of ram, the machine has swap, and no swap is in use…