Best way for a continuous backup to remote servers, with DR in mind?

Hi all!

I’m new to Sanoid/Syncoid, so please bear with me… :slight_smile:

Firstly, thanks and congratulations for such a nice tool!

Secondly, I’m in the process of setting up a PoC with the following:

  1. server-A (where the main application runs)
  2. server-B (backup server)
  3. server-C (where the main application will run if server-A dies)

The main application generates data and logs continuously so, in the case of a DR, the server-C must have access to the latest information and logs, the faster the better.

To achieve this, I’ve written a Bash script which runs syncoid to push snapshots to both servers B and C:

#!/bin/bash

pool=PRODPOOL
remote_server1=server-B
remote_server2=server-C
remote_user=syncoid
remote_pool1=BACKUPS    # server-B
remote_pool2=PRODPOOL   # server-C
wait_time=60
syncoid_opts="--no-privilege-elevation --no-sync-snap --create-bookmark --recursive"
[ $1 == "-d" ] && DEBUG="--debug"

# main
while :
do
        clear
        date
        echo -e "\n===== SEND_TO: $remote_server1"
        syncoid $DEBUG $syncoid_opts $pool $remote_user@$remote_server1:$remote_pool1
        echo -e "\n===== SEND_TO: $remote_server2"
        syncoid $DEBUG  $syncoid_opts $pool $remote_user@$remote_server2:$remote_pool2
        echo -e "\n===== WAITING $wait_time s..."
        sleep $wait_time
done

Despite the application on server-A steadily produces stuff, syncoid shows this message:

===== SEND_TO: server-B
WARN: --no-sync-snap is set, and getnewestsnapshot() could not find any snapshots on source for current dataset. Continuing.
CRITICAL: no snapshots exist on source PRODPOOL, and you asked for --no-sync-snap.
NEWEST SNAPSHOT: autosnap_2023-08-08_14:00:42_hourly
INFO: no snapshots on source newer than autosnap_2023-08-08_14:00:42_hourly on target. Nothing to do, not syncing.
NEWEST SNAPSHOT: autosnap_2023-08-08_14:00:42_hourly
INFO: no snapshots on source newer than autosnap_2023-08-08_14:00:42_hourly on target. Nothing to do, not syncing.

===== SEND_TO: server-C
WARN: --no-sync-snap is set, and getnewestsnapshot() could not find any snapshots on source for current dataset. Continuing.
CRITICAL: no snapshots exist on source PRODPOOL, and you asked for --no-sync-snap.
NEWEST SNAPSHOT: autosnap_2023-08-08_14:00:42_hourly
INFO: no snapshots on source newer than autosnap_2023-08-08_14:00:42_hourly on target. Nothing to do, not syncing.
NEWEST SNAPSHOT: autosnap_2023-08-08_14:00:42_hourly
INFO: no snapshots on source newer than autosnap_2023-08-08_14:00:42_hourly on target. Nothing to do, not syncing.

…which should not be the case as data and logs are produced at all times…

Questions to you, Sanoid/Syncoid gurus out there:

  1. am I doing it right? if not, which approach should be better?
  2. is it enough with syncoid snapshots?

TIA!

Best

WARN: --no-sync-snap is set, and getnewestsnapshot() could not find any snapshots on source for current dataset. Continuing.

Means exactly what it says on the tin–replication is snapshot-based, you’re not taking any snapshots, and you’ve disabled automatic sync snapshots. The best practice here is to run Sanoid on both source and target, using appropriate templates (production on source, backup or hotspare on target), to take snapshots on the source and prune stale ones on both source and target.

Again, this means Sanoid running locally on both source and targets. :slight_smile:

If I remove the --no-sync-snap flag, I get this instead:

===== SEND_TO: server-B

CRITICAL ERROR: Target BACKUPS exists but has no snapshots matching with PRODPOOL!
Replication to target would require destroying existing
target. Cowardly refusing to destroy your existing target.

      NOTE: Target BACKUPS dataset is < 64MB used - did you mistakenly run
            `zfs create syncoid@server-B:BACKUPS` on the target? ZFS initial
            replication must be to a NON EXISTENT DATASET, which will
            then be CREATED BY the initial replication process.

Sending incremental PRODPOOL/dataset@autosnap_2023-08-08_16:00:43_hourly … syncoid_server-A_2023-08-08:16:30:33-GMT00:00 (~ 19 KB):
157KiB 0:00:00 [1.27MiB/s] [===============================================================================================================================] 823%
Sending incremental PRODPOOL/dataset/logs@autosnap_2023-08-08_16:00:43_hourly … syncoid_server-A_2023-08-08:16:30:34-GMT00:00 (~ 15 KB):
129KiB 0:00:00 [1000KiB/s] [===============================================================================================================================] 857%

===== SEND_TO: server-C

CRITICAL ERROR: Target PRODPOOL exists but has no snapshots matching with PRODPOOL!
Replication to target would require destroying existing
target. Cowardly refusing to destroy your existing target.

      NOTE: Target PRODPOOL dataset is < 64MB used - did you mistakenly run
            `zfs create syncoid@server-C:PRODPOOL` on the target? ZFS initial
            replication must be to a NON EXISTENT DATASET, which will
            then be CREATED BY the initial replication process.

Sending incremental PRODPOOL/dataset@autosnap_2023-08-08_16:00:43_hourly … syncoid_server-A_2023-08-08:16:30:36-GMT00:00 (~ 19 KB):
157KiB 0:00:00 [1.21MiB/s] [===============================================================================================================================] 800%
Sending incremental PRODPOOL/dataset/logs@autosnap_2023-08-08_16:00:43_hourly … syncoid_server-A_2023-08-08:16:30:36-GMT00:00 (~ 15 KB):
130KiB 0:00:00 [ 961KiB/s] [===============================================================================================================================] 828%

===== WAITING 60 s…

which looks even worse… :slight_smile:

Run Sanoid on both source and target servers? I have questions then…
a) doesn’t Sanoid create local snapshots only?
b) how do I interconnect both Sanoid instances?
c) I tried to take snapshots every minute with Sanoid by using this template:

[template_production]
        frequently = 0
        hourly = 5
        daily = 5
        monthly = 1
        yearly = 0
        autosnap = yes
        autoprune = yes
        frequent_period = 1 <-- this here should do the trick, or?

but it didn’t work either, that’s why I tried my luck with Syncoid… (notice that my script runs syncoid every 60 s)

I really need to take and transfer snapshots from the primary node to both the backup and failover node every 60 seconds.

Is that possible at all? If so, could you please provide me/us with some working configs?

TIA!

CRITICAL ERROR: Target BACKUPS exists but has no snapshots matching with PRODPOOL!

Again, means exactly what it says on the tin. Replication is snapshot-based. You cannot replicate from one dataset to another if they do not have any common snapshots.

This works:

root@box:~# zfs create POOLA/mystuff ; zfs snapshot POOLA/mystuff@1
root@box:~# zfs list POOLB/mystuff
cannot open 'rpool/fucknut': dataset does not exist
root@box:~# syncoid POOLA/mystuff POOLB/mystuff
INFO: Sending oldest full snapshot POOLA/mystuff@1 (~ 42 KB) to new target filesystem:
45.8KiB 0:00:00 [11.2MiB/s] [=================================] 107%            
INFO: Updating new target filesystem with incremental POOLA/mystuff@1 ... syncoid_elden_2023-08-08:12:56:40-GMT-04:00 (~ 102 KB):
52.3KiB 0:00:00 [ 604KiB/s] [================>                 ] 50%  

That works because POOLB/mystuff did not exist yet, so the initial replication created it (with matching snapshot to the oldest snapshot on POOLA/mystuff), then it caught up to the current state (as captured by the syncoid sync snapshot, since I didn’t use --no-sync-snap) with an incremental replication after the full (this is all part of the first and only syncoid command you saw there).

Trying to create the target dataset first does NOT work:

root@box:~# zfs destroy POOLB/mystuff ; zfs create poolB/mystuff
root@box:~# syncoid POOLA/mystuff POOLB/mystuff

CRITICAL ERROR: Target rpool/test2 exists but has no snapshots matching with rpool/test!
                Replication to target would require destroying existing
                target. Cowardly refusing to destroy your existing target.

          NOTE: Target rpool/test2 dataset is < 64MB used - did you mistakenly run
                `zfs create rpool/test2` on the target? ZFS initial
                replication must be to a NON EXISTENT DATASET, which will
                then be CREATED BY the initial replication process.

Note that the error message tells you what you need to know here, just as it did when you received it. :slight_smile:

a) doesn’t Sanoid create local snapshots only?
b) how do I interconnect both Sanoid instances?
c) I tried to take snapshots every minute with Sanoid by using this template:

a) yes
b) they don’t “interconnect”, you just define your policies on both ends as appropriate. For example, you might want both ends to keep 30 hourlies, 30 dailies, and 3 monthlies… or you might want that on the source, but want 60 hourlies, 90 dailies, and 12 monthlies on a much larger backup system. It’ll do what you want, as you tell it to, on both sides. They don’t have to be identical, they just have to maintain at least one matching snapshot (so as not to break the replication chain).

    frequently = 0

This tells Sanoid to keep zero “frequently” snapshots. If you want them taken every minute, you probably want something more along the lines of frequently=60… or at the very least frequently=10, because you don’t want to lose the most recent common snapshot on either end.

I really need to take and transfer snapshots from the primary node to both the backup and failover node every 60 seconds.

Is that possible at all? If so, could you please provide me/us with some working configs?

Depends on your system load and network throughput, but it’s probably doable. You want frequently=10 (minimum) on both sides, and you want to schedule a syncoid run (with --no-sync-snap, since you’re taking all these frequentlies) once per minute.

If the system gets particularly overloaded, you may not be able to replicate twice in two minutes every now and then–but as long as it’s not crushed under overload (and neither is your network), it’s not a big deal, because an attempt at replicating again while a prior is underway will simply fail out safely. So the net effect is not “replication is broken” just “we only replicated eight times in the last ten minutes instead of ten.”

If you need a hard guarantee of data being synced in real time between systems, without any possibility of a missed sync, replication won’t cut it, you’ll need proper HA. That can still be on top of ZFS, but you’ll be looking for something like DRBD to sit on top of it, with DRBD handling real-time mirroring and locking as necessary.

Thanks for all your answers. I will try to implement your suggestions.
Perhaps the already informative warn/error messages could provide some (potential) solution to the problem they point out.

This assumes something that didn’t happen… so perhaps it could assume a solution as well :wink:

My next idea is to create another PoC using GlusterFS or Ceph instead for real-time data replication and not use ZFS-related tools because there’s always a gap among the servers, but still I want to get a better understanding on how ZFS does all this under the hood :slight_smile: Of course, combining GlusterFS and ZFS is also an option for an extra layer of security.

Thanks!

Hi @mercenary_sysadmin

I’m a bit confused now…
I created 3 new VMs:

  • zfs-node1, as active node
  • zfs-backup, as backup server
  • zfs-node2, as failover node

…and tried to mimic your example from above:

First, I granted the syncoid user on the backup server all the needed permissions:

root@zfs-backup:~/zfs_test# zfs allow -u syncoid compression,mountpoint,create,mount,receive,send,rollback,destroy BACKUPS

From zfs-node1 I executed this to create a snapshot:

syncoid@zfs-node1:~$ zfs snapshot PRODPOOL/dataset@test1
syncoid@zfs-node1:~$ zfs list -t snapshot
NAME                     USED  AVAIL     REFER  MOUNTPOINT
PRODPOOL/dataset@test1     0B      -       90K  -
syncoid@zfs-node1:~$

Right after that, I executed syncoid as follows to transfer the snapshot to a remote pool on the backup server:

syncoid@zfs-node1:~$ syncoid --debug --no-privilege-elevation PRODPOOL/dataset syncoid@zfs-backup:BACKUPS/dataset
DEBUG: SSHCMD: ssh
DEBUG: checking availability of lzop on source...
DEBUG: checking availability of lzop on target...
DEBUG: checking availability of lzop on local machine...
DEBUG: checking availability of mbuffer on source...
DEBUG: checking availability of mbuffer on target...
DEBUG: checking availability of pv on local machine...
DEBUG: checking availability of zfs resume feature on source...
DEBUG: checking availability of zfs resume feature on target...
DEBUG: syncing source PRODPOOL/dataset to target BACKUPS/dataset.
DEBUG: getting current value of syncoid:sync on PRODPOOL/dataset...
  zfs get -H syncoid:sync 'PRODPOOL/dataset'
DEBUG: checking to see if BACKUPS/dataset on ssh      -S /tmp/syncoid-syncoid@zfs-backup-1691758999-4439 syncoid@zfs-backup is already in zfs receive using ssh      -S /tmp/syncoid-syncoid@zfs-backup-1691758999-4439 syncoid@zfs-backup ps -Ao args= ...
DEBUG: checking to see if target filesystem exists using "ssh      -S /tmp/syncoid-syncoid@zfs-backup-1691758999-4439 syncoid@zfs-backup  zfs get -H name ''"'"'BACKUPS/dataset'"'"'' 2>&1 |"...
DEBUG: getting list of snapshots on PRODPOOL/dataset using   zfs get -Hpd 1 -t snapshot guid,creation 'PRODPOOL/dataset' |...
DEBUG: creating sync snapshot using "  zfs snapshot 'PRODPOOL/dataset'@syncoid_zfs-node1_2023-08-11:13:03:20-GMT00:00
"...
DEBUG: target BACKUPS/dataset does not exist.  Finding oldest available snapshot on source PRODPOOL/dataset ...
DEBUG: getting estimated transfer size from source  using "  zfs send  -nvP 'PRODPOOL/dataset@test1' 2>&1 |"...
DEBUG: sendsize = 46704
INFO: Sending oldest full snapshot PRODPOOL/dataset@test1 (~ 45 KB) to new target filesystem:
DEBUG:  zfs send  'PRODPOOL/dataset'@'test1' | pv -p -t -e -r -b -s 46704 | lzop  | mbuffer  -q -s 128k -m 16M | ssh      -S /tmp/syncoid-syncoid@zfs-backup-1691758999-4439 syncoid@zfs-backup ' mbuffer  -q -s 128k -m 16M | lzop -dfc |  zfs receive  -s -F '"'"'BACKUPS/dataset'"'"''
DEBUG: checking to see if BACKUPS/dataset on ssh      -S /tmp/syncoid-syncoid@zfs-backup-1691758999-4439 syncoid@zfs-backup is already in zfs receive using ssh      -S /tmp/syncoid-syncoid@zfs-backup-1691758999-4439 syncoid@zfs-backup ps -Ao args= ...
 881KiB 0:00:00 [29.6MiB/s] [=============================================================================================================================] 1932%
cannot mount '/BACKUPS/dataset': failed to create mountpoint: Permission denied
CRITICAL ERROR:  zfs send  'PRODPOOL/dataset'@'test1' | pv -p -t -e -r -b -s 46704 | lzop  | mbuffer  -q -s 128k -m 16M | ssh      -S /tmp/syncoid-syncoid@zfs-backup-1691758999-4439 syncoid@zfs-backup ' mbuffer  -q -s 128k -m 16M | lzop -dfc |  zfs receive  -s -F '"'"'BACKUPS/dataset'"'"'' failed: 256 at /usr/sbin/syncoid line 549.
syncoid@zfs-node1:~$

Despite the CRITICAL ERROR shown above, I can see both the snapshot and the contents on the backup server:

syncoid@zfs-backup:~$ zfs list -t snapshot
NAME                    USED  AVAIL     REFER  MOUNTPOINT
BACKUPS/dataset@test1    15K      -       89K  -
syncoid@zfs-backup:~$

root@zfs-backup:~/zfs_test# zfs list
NAME              USED  AVAIL     REFER  MOUNTPOINT
BACKUPS          7.23M  9.20G       24K  /BACKUPS
BACKUPS/dataset   105K  9.20G       89K  /BACKUPS/dataset
root@zfs-backup:~/zfs_test#

root@zfs-backup:~/zfs_test# ls -lrt /BACKUPS/dataset/ | tail -n5
-rw-r--r-- 1 root root 2097152 Aug 11 12:58 fakedata-2023-08-11UTC12:58:57-wt56.log
-rw-r--r-- 1 root root 2097152 Aug 11 12:59 fakedata-2023-08-11UTC12:59:53-wt60.log
-rw-r--r-- 1 root root 2097152 Aug 11 13:00 fakedata-2023-08-11UTC13:00:53-wt16.log
-rw-r--r-- 1 root root 2097152 Aug 11 13:01 fakedata-2023-08-11UTC13:01:09-wt12.log
-rw-r--r-- 1 root root 2097152 Aug 11 13:01 fakedata-2023-08-11UTC13:01:21-wt31.log
root@zfs-backup:~/zfs_test#

After that, I tried to pull the snapshot from the backup server onto the zfs-node2.

First, I make sure that syncoid has the right permissions:

root@zfs-node2:~# zfs allow -u syncoid compression,mountpoint,create,mount,receive,snapshot,rollback,destroy PRODPOOL
root@zfs-node2:~#

and then I pull the snapshot:

syncoid@zfs-node2:~$ syncoid --no-privilege-elevation syncoid@zfs-backup:BACKUPS/dataset PRODPOOL/dataset
INFO: Sending oldest full snapshot BACKUPS/dataset@test1 (~ 45 KB) to new target filesystem:
 881KiB 0:00:00 [21.2MiB/s] [===========================================================================] 1953%
cannot mount '/PRODPOOL/dataset': failed to create mountpoint: Permission denied
CRITICAL ERROR: ssh      -S /tmp/syncoid-syncoid@zfs-backup-1691760148-6578 syncoid@zfs-backup ' zfs send  '"'"'BACKUPS/dataset'"'"'@'"'"'test1'"'"' | lzop  | mbuffer  -q -s 128k -m 16M' | mbuffer  -q -s 128k -m 16M | lzop -dfc | pv -p -t -e -r -b -s 46192 |  zfs receive  -s -F 'PRODPOOL/dataset' failed: 256 at /usr/sbin/syncoid line 549.
syncoid@zfs-node2:~$

A quick check of the contents:

root@zfs-node2:~# ls -lrt /PRODPOOL/dataset/ | tail -n5
-rw-r--r-- 1 root root 2097152 Aug 11 12:58 fakedata-2023-08-11UTC12:58:57-wt56.log
-rw-r--r-- 1 root root 2097152 Aug 11 12:59 fakedata-2023-08-11UTC12:59:53-wt60.log
-rw-r--r-- 1 root root 2097152 Aug 11 13:00 fakedata-2023-08-11UTC13:00:53-wt16.log
-rw-r--r-- 1 root root 2097152 Aug 11 13:01 fakedata-2023-08-11UTC13:01:09-wt12.log
-rw-r--r-- 1 root root 2097152 Aug 11 13:01 fakedata-2023-08-11UTC13:01:21-wt31.log
root@zfs-node2:~#

So, it seems like only the “mountpoint” permission was not observed by the receiving syncoid…
Other than that, it worked, right?

TIA and have a good weekend!

Correct, everything worked but the mountpoint stuff… and the reason that didn’t work is because you both need ZFS permissions and standard system permissions in order to set mountpoints and mount things. So although you successfully delegated the ZFS side of it, you can’t do the regular system part of it without sudo.

This is a papercut that you see a lot with delegated replication; the easiest way around it is generally to set canmount=noauto on the targets, so ZFS won’t try to automatically mount them (using the syncoid UID’s permissions, which are insufficient without sudo) during the receive process.

Unfortunately, using canmount=noauto didn’t solve the problem…

syncoid@zfs-node1:~$ zfs snapshot PRODPOOL/dataset@test1
syncoid@zfs-node1:~$ syncoid --no-privilege-elevation PRODPOOL/dataset syncoid@zfs-backup:BACKUPS/dataset
INFO: Sending oldest full snapshot PRODPOOL/dataset@test1 (~ 36 KB) to new target filesystem:
582KiB 0:00:00 [21.0MiB/s] [====================================================================================================================================================] 1613%
cannot mount ‘/BACKUPS/dataset’: failed to create mountpoint: Permission denied
CRITICAL ERROR: zfs send ‘PRODPOOL/dataset’@‘test1’ | pv -p -t -e -r -b -s 36976 | lzop | mbuffer -q -s 128k -m 16M | ssh -S /tmp/syncoid-syncoid@zfs-backup-1692286332-5539 syncoid@zfs-backup ’ mbuffer -q -s 128k -m 16M | lzop -dfc | zfs receive -s -F ‘"’“‘BACKUPS/dataset’”‘"’’ failed: 256 at /usr/sbin/syncoid line 549.
syncoid@zfs-node1:~$

These are the permissions and properties on the pool:

root@zfs-backup:~/zfs_test# zfs allow BACKUPS
---- Permissions on BACKUPS ------------------------------------------
Local+Descendent permissions:
        user syncoid bookmark,compression,create,destroy,mount,mountpoint,receive,rollback,send,snapshot
root@zfs-backup:~/zfs_test# zfs get canmount BACKUPS
NAME     PROPERTY  VALUE     SOURCE
BACKUPS  canmount  noauto    local
root@zfs-backup:~/zfs_test#

Still, the snapshot dataset@test1 makes it way from zfs-node1 to zfs-backup and, eventually, to zfs-node2, as I already showed in my previous post.

Now, if I invert the polarity, that is: zfs-node2 --push-> zfs-backup <–pull-- zfs-node1, things don’t work as expected, despite ZFS permissions are set as described …

First, I create a new snapshot on zfs-node2:

syncoid@zfs-node2:~$ zfs snapshot PRODPOOL/dataset@test2

so I end up with these 2 guys:

syncoid@zfs-node2:~$ zfs list -t snapshot
NAME                                                              USED  AVAIL     REFER  MOUNTPOINT
PRODPOOL/dataset@test1                                              0B      -       65K  -
PRODPOOL/dataset@test2                                              0B      -       65K  -

so far, so good…

Now, when I want to send the snapshots, this happens:

syncoid@zfs-node2:~$ syncoid --no-privilege-elevation PRODPOOL/dataset syncoid@zfs-backup:BACKUPS/dataset
Sending incremental PRODPOOL/dataset@test1 … syncoid_zfs-node2_2023-08-17:15:39:44-GMT00:00 (~ 4 KB):
cannot hold: permission denied
cannot send ‘PRODPOOL/dataset’: permission denied
624 B 0:00:00 [34.7KiB/s] [============> ] 15%
cannot receive: failed to read from stream
CRITICAL ERROR: zfs send -I ‘PRODPOOL/dataset’@‘test1’ ‘PRODPOOL/dataset’@‘syncoid_zfs-node2_2023-08-17:15:39:44-GMT00:00’ | pv -p -t -e -r -b -s 4096 | lzop | mbuffer -q -s 128k -m 16M | ssh -S /tmp/syncoid-syncoid@zfs-backup-1692286783-4102 syncoid@zfs-ba
ckup ’ mbuffer -q -s 128k -m 16M | lzop -dfc | zfs receive -s -F ‘"’“‘BACKUPS/dataset’”‘"’ 2>&1’ failed: 256 at /usr/sbin/syncoid line 889.
syncoid@zfs-node2:~$

ZFS permissions should allow the “send” part, or?

syncoid@zfs-node2:~$ zfs allow PRODPOOL
---- Permissions on PRODPOOL -----------------------------------------
Local+Descendent permissions:
        user syncoid bookmark,compression,create,destroy,mount,mountpoint,receive,rollback,send,snapshot
syncoid@zfs-node2:~$

On the zfs-backup server, I get this instead of the dataset@test2

syncoid@zfs-backup:~$ zfs list -t snapshot
NAME                                                             USED  AVAIL     REFER  MOUNTPOINT
BACKUPS/dataset@test1                                              0B      -       65K  -
BACKUPS/dataset@syncoid_zfs-node2_2023-08-17:15:32:39-GMT00:00     0B      -       65K  -

I’m about to bang my head against the wall…

TIA!