Syncoid crashing ZFS on target

pgesting · September 2, 2024, 8:23am

I think the root issue is with ZFS, not syncoid, so I am putting this here.

When I run syncoid to sync sanoid snapshots from my main server to my backup, zfs on the backup crashes. I cannot do any zfs or zpool commands. I get the following in dmesg:

[ 1813.431494] INFO: task zfs:3971 blocked for more than 483 seconds.
[ 1813.431555]       Tainted: P         C O      5.15.0-odroid-arm64 #1 Ubuntu 5.15.145-202401081659~jammy                                                                   
[ 1813.431621] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.                                                                                     
[ 1813.431675] task:zfs             state:D stack:    0 pid: 3971 ppid:  3968 flags:0x00000004

As can be seen, my target is an odroid device with a ZFS mirror. I ran a scrub, it says no errors. But every time I try to do a zfs send, zfs crashes on the target. The only thing I can do is reboot.

What other diagnostic info can I get that would help debug?

bladewdr · September 3, 2024, 8:18pm

I’d be curious to know if you have the same behavior if you do a replication with just plain old zfs commands.

mercenary_sysadmin · September 3, 2024, 11:53pm

Be kind of impossible not to, since syncoid literally just uses plain old zfs commands under the hood. Since we’re seeing the zfs task itself hanging, there’s really nothing syncoid can do to make that better or worse.

Here’s the thing that leaped out at me:

5.15.0-odroid-arm64

Those builds get less in the way of testing than x86 builds do, for one thing. For another, bittyboxes are a lot more prone to performance issues, in terms of both CPU and bus transport (read: storage controller).

The first thing is figuring out whether it’s a load problem. Easiest way to figure that out is with extremely small replication attempts: try creating a new dataset on the source, leaving it empty, creating a snapshot of it, and replicating that snapshot: did it arrive properly, or did you get the hang you’re seeing above?

If the empty snap arrives properly, experiment with more snapshots and larger amounts of data until you get a feel for when you trigger the problem. Once you’ve got a handle on that, you will probably have a pretty good idea whether you’re looking at a simple load issue, or an actual bug, and can proceed accordingly.

If you come to the conclusion that you’ve got a bug, the next thing is to try reverting to older kernel and ZFS versions and seeing if the behavior goes away. Once you feel confident “the bug appears in version foo.bar but not in foo.baz,” you have successfully bisected to find when the bug or regression was introduced, and you’re in EXCELLENT shape to file a bug report with the project.

You can, of course, try filing a bug earlier than that and with less information. But the closer you can get to “I bisected and this worked on foo.bar but is broken by foo.baz,” the more likely you’ll get the problem resolved.