Zfs receive zombie processes if syncoid "crashes"

pgesting · December 23, 2023, 8:24pm

I can’t think of a better title…

I have syncoid replicating datasets from my main proxmox server to a backup server. I originally had it set up in a cron job, but it would crash my server after about 2 weeks. I think that is a different problem that is somewhat related, but I won’t ask that one just yet.

When I run syncoid, and then something happens (let’s say that I log into my server, start a syncoid command, and then my connection gets cut [and also assume that for some reason I am not running tmux or screen or anything in my server]) and my job gets killed.

On the backup server side, there is still a zfs recieve job that is sitting there. I cannot kill it. The only thing I can do is restart the backup server.

I use the --no-privilege-elevation so it is just my normal user on the backup server, but I still cannot kill the job. And so then I cannot start a new syncoid job. Eventually the IO delay on my backup server goes through the roof.

Is there some way to set a timeout or something for the zfs receive command?

mercenary_sysadmin · December 25, 2023, 5:47pm

In theory, it should already time out after a certain amount of time with no data.

I’ve seen issues like what you’re describing, but they’re definitely less common, and I would call that an OpenZFS bug. Typically, most receive-related processes are killable, and the system-controlled ones that aren’t will seppuku themselves honorably once the user-killable ones are gone.

I’d recommend filing this as an OpenZFS issue. Be CERTAIN to give all the information requested about distribution (obviously, Proxmox in this case) and ZFS version installed.

pgesting · May 3, 2024, 8:21am

Okay, I am getting this again. It is happening right now, but it is within the same system. I have two pools on my server, a ZFS raid mirror for my main things, and then a ZFS raid 3 for data. I use syncoid to send snapshots of some of my stuff from the mirror to the larger pool. I now have a zfs receive that is hung. It is using 100% of the CPU and my IO delay is now at 60%. I cannot kill the zfs recieve, or the zfs PID that is in “uninterruptable sleep” which looks to be a sanoid autosync from about 4 hours ago.

mercenary_sysadmin · May 3, 2024, 1:12pm

This is still something you need to address upstream with the openzfs devs. Syncoid itself literally can’t cause this to happen; it’s unwanted behavior in zfs receive, not in syncoid.