Zfs receive zombie processes if syncoid "crashes"

I can’t think of a better title…

I have syncoid replicating datasets from my main proxmox server to a backup server. I originally had it set up in a cron job, but it would crash my server after about 2 weeks. I think that is a different problem that is somewhat related, but I won’t ask that one just yet.

When I run syncoid, and then something happens (let’s say that I log into my server, start a syncoid command, and then my connection gets cut [and also assume that for some reason I am not running tmux or screen or anything in my server]) and my job gets killed.

On the backup server side, there is still a zfs recieve job that is sitting there. I cannot kill it. The only thing I can do is restart the backup server.

I use the --no-privilege-elevation so it is just my normal user on the backup server, but I still cannot kill the job. And so then I cannot start a new syncoid job. Eventually the IO delay on my backup server goes through the roof.

Is there some way to set a timeout or something for the zfs receive command?

In theory, it should already time out after a certain amount of time with no data.

I’ve seen issues like what you’re describing, but they’re definitely less common, and I would call that an OpenZFS bug. Typically, most receive-related processes are killable, and the system-controlled ones that aren’t will seppuku themselves honorably once the user-killable ones are gone.

I’d recommend filing this as an OpenZFS issue. Be CERTAIN to give all the information requested about distribution (obviously, Proxmox in this case) and ZFS version installed.