Can multiple syncoid jobs run in parallel pulling snapshots from the same remote?

wholteza · August 20, 2023, 7:38am

I’m in the process of setting up my 1st backup(machine 2 from now on) to my main nas(machine 1 from now on).
The setup is consisting of 2 machines both running zfs and i have sucessfully pulled snapshots for one of my datasets using the command, running it manually from the cli:

/usr/sbin/syncoid --no-privilege-elevation --recursive --no-sync-snap --cr
eate-bookmark syncoid@<machine1>:tank/users/x backup/tank-replication/users/x

I have a couple of datasets on machine 1 and i only want to pull some of them since i don’t regard all datasets as backup worthy. Lets say the structure of the datasets that i want to pull look something like this:

tank/users/x -> backup/tank-replication/users/x
tank/users/y -> backup/tank-replication/users/y
tank/users/z -> backup/tank-replication/users/z
tank/groups/x -> backup/tank-replication/groups/z
services/x -> backup/services-replication/x
services/y -> backup/services-replication/y

I also use ansible to set up my machines so i created a loop that will generate one cron job per dataset i want to pull, which resulted in 5 cron jobs that have the same execution time each day with the same command i listed above.

When the cron jobs run all of them crash.
I can see that zfs datasets have been created on machine 2 with a tiny amount, maybe 500MB of data in each.
This is the journal for machine 2 during that time for one of the jobs:

Aug 19 15:31:01 machine2 CRON[46847]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Aug 19 15:31:01 machine2 CRON[46854]: (root) CMD (/usr/sbin/syncoid --no-privilege-elevation --recursive --no-sync-snap syncoid@<machine1>:tank/users/x backup/tank-replication/users/x)
Aug 19 15:31:21 machine2 sSMTP[47160]: Creating SSL connection to host
Aug 19 15:31:21 machine2 rsyslogd[7602]: action 'action-5-builtin:omfile' resumed (module 'builtin:omfile') [v8.2112.0 try https://www.rsyslog.com/e/2359 ]
Aug 19 15:31:21 machine2 rsyslogd[7602]: action 'action-5-builtin:omfile' suspended (module 'builtin:omfile'), retry 0. There should be messages before this one giving the reason for suspension. 
Aug 19 15:31:21 machine2 sSMTP[47160]: SSL connection using ECDHE_RSA_AES_256_GCM_SHA384
Aug 19 15:31:21 machine2 cron[47160]: sendmail: 550 5.7.1 [M12] User [hosting@<mydomain>] not authorized to send on behalf of <root@<mydomain>> (72b7581b-3ea5-11ee-8622-55333ba73462)
Aug 19 15:31:21 machine2 sSMTP[47160]: 550 5.7.1 [M12] User [hosting@<mydomain>] not authorized to send on behalf of <root@<mydomain>> (72b7581b-3ea5-11ee-8622-55333ba73462)
Aug 19 15:31:21 machine2 CRON[46853]: (root) MAIL (mailed 284 bytes of output but got status 0x0001 from MTA
Aug 19 15:31:21 machine2 CRON[46853]: pam_unix(cron:session): session closed for user root

Journal from machine 1

Aug 19 15:31:03 machine1 systemd[1]: Started Session 13102 of User syncoid.
Aug 19 15:31:03 machine1 sshd[3296152]: Received disconnect from 192.168.2.180 port 48386:11: disconnected by user
Aug 19 15:31:03 machine1 sshd[3296152]: Disconnected from user syncoid 192.168.2.180 port 48386
Aug 19 15:31:03 machine1 sshd[3295806]: pam_unix(sshd:session): session closed for user syncoid
Aug 19 15:31:03 machine1 systemd[1]: session-13103.scope: Deactivated successfully.
Aug 19 15:31:03 machine1 systemd-logind[1256]: Session 13103 logged out. Waiting for processes to exit.
Aug 19 15:31:03 machine1 systemd-logind[1256]: Removed session 13103.

I don’t see anything obvious here, except that an email isn’t being sent.

On top of that i tried commenting out all but one of the jobs and it ran to completion without issues.
So the command, connection and permission work, it synced 2.66TB of data during the night.
But somehow when running multiple at the same time they fail.

Is it even possible to pull multiple snapshots from the same pool in parallel using syncoid?
If so, can you give me an example of your setup ?

Thanks in advance

Edit: For now to get anything pulled at all i created a bash script that has has one command per dataset to pull concatenated with " && " to run everything sequentially instead. My original question still remains.

mercenary_sysadmin · August 20, 2023, 7:34pm

It should work fine with multiple syncoid processes pulling from the same source, but when I investigate now I’m seeing issues, at least when called extremely quickly:

for i in {1..4} ; do 
    syncoid --no-sync-snap rpool/demo/0 rpool/demo/$i & sleep 1 ; 
done

This results in what appears to be a crash in the ZFS send or receive process itself: syncoid terminates, but the send and receive processes themselves are still hanging around (and not actually moving data between each other).

I’ll have to investigate further when I get a chance; this was a very unexpected result since I’ve got plenty of machines in trio sets, where C and B both pull from A, and have not seen any issues there. So I’m not sure what’s different now, whether this is a regression since older versions, or whether this is an edge case that only crops up with VERY rapid invocations.

edit: it looks like this is a bug that crops up when not using --quiet; when I specify --quiet I get the expected result:

root@elden:/# zfs list -r rpool/demo
NAME           USED  AVAIL     REFER  MOUNTPOINT
rpool/demo    1.00G  1.11T       96K  /demo
rpool/demo/0  1.00G  1.11T     1.00G  /demo/0

We have a dataset with 1GiB of data in it, which we will replicate using four parallel syncoid processes to four separate targets.

root@elden:/# for i in {1..4} ; do 
                  syncoid --no-sync-snap --quiet rpool/demo/0 rpool/demo/$i &
              done
[15] 897559
[16] 897560
[17] 897561
[18] 897562
[11]   Done                    syncoid --no-sync-snap --quiet rpool/demo/0 rpool/demo/$i
[12]   Done                    syncoid --no-sync-snap --quiet rpool/demo/0 rpool/demo/$i
[13]   Done                    syncoid --no-sync-snap --quiet rpool/demo/0 rpool/demo/$i
[14]   Done                    syncoid --no-sync-snap --quiet rpool/demo/0 rpool/demo/$i

We used a simple for loop to do our four parallel syncoid runs… but crucially, this time we used --quiet in our argument list.

root@elden:/# zfs list -r rpool/demo
NAME           USED  AVAIL     REFER  MOUNTPOINT
rpool/demo    5.00G  1.10T       96K  /demo
rpool/demo/0  1.00G  1.10T     1.00G  /demo/0
rpool/demo/1  1.00G  1.10T     1.00G  /demo/1
rpool/demo/2  1.00G  1.10T     1.00G  /demo/2
rpool/demo/3  1.00G  1.10T     1.00G  /demo/3
rpool/demo/4  1.00G  1.10T     1.00G  /demo/4

And we can see that our four parallel processes worked properly. I don’t know what bug we’re encountering during the print output phase, but in the meantime, this should help you work around it. If you can confirm that this works for you, would you mind opening up a bug report outlining what you’re encountering over at Github? Thanks!

wholteza · August 21, 2023, 8:35pm

@mercenary_sysadmin I tried your suggestion of adding --quiet but unfortunately the cronjobs still crash.
This is what’s in my root crontab now.

#Ansible: Syncoid pull snapshots services/docker
26 20 * * * /usr/sbin/syncoid --quiet --no-privilege-elevation --recursive --no-sync-snap --create-bookmark syncoid@machine1:services/docker backup/tank-replication/services/docker
#Ansible: Syncoid pull snapshots services/kvm
26 20 * * * /usr/sbin/syncoid --quiet --no-privilege-elevation --recursive --no-sync-snap --create-bookmark syncoid@machine1:services/kvm backup/tank-replication/services/kvm
#Ansible: Syncoid pull snapshots tank/groups/a
26 20 * * * /usr/sbin/syncoid --quiet --no-privilege-elevation --recursive --no-sync-snap --create-bookmark syncoid@machine1:tank/groups/a backup/tank-replication/tank/groups/a
#Ansible: Syncoid pull snapshots tank/users/a
26 20 * * * /usr/sbin/syncoid --quiet --no-privilege-elevation --recursive --no-sync-snap --create-bookmark syncoid@machine1:tank/users/a backup/tank-replication/tank/users/a
#Ansible: Syncoid pull snapshots tank/users/b
26 20 * * * /usr/sbin/syncoid --quiet --no-privilege-elevation --recursive --no-sync-snap --create-bookmark syncoid@machine1:tank/users/b backup/tank-replication/tank/users/b
#Ansible: Syncoid pull snapshots tank/users/c
26 20 * * * /usr/sbin/syncoid --quiet --no-privilege-elevation --recursive --no-sync-snap --create-bookmark syncoid@machine1:tank/users/c backup/tank-replication/tank/users/c
#Ansible: Syncoid pull snapshots tank/users/d
26 20 * * * /usr/sbin/syncoid --quiet --no-privilege-elevation --recursive --no-sync-snap --create-bookmark syncoid@machine1:tank/users/d backup/tank-replication/tank/users/d

Is there anything i can do on my end to get more insight into what is making it crash.
Verbose logging?