New snapshots created on backup host despite autosnap being turned off

Hey all,

Just noticed that my backup host at home has been failing to run incrementals for the past few weeks. Unfortunately, i have no choice now but to blow the whole thing away and start again from scratch, as the incrementals stopped replicating on May 25th, and I only keep 2 weeks of snapshots on my production host. :frowning:

Somehow the SSH host keys got regenerated for the production system, so the SSH connections were failing.

At this point, Iā€™m just curious why Sanoid is taking hourlies when I have autosnap turned off?

Hereā€™s my sanoid.conf on the backup host:

[backup/media]
        use_template = backup
        recursive = yes
        # process_children_only = yes

#############################
# templates below this line #
#############################

[template_backup]

        frequently = 0
        hourly = 72
        daily = 21
        weekly = 3
        monthly = 0
        yearly = 0
        autosnap = no
        autoprune = yes
        hourly_warn = 0s
        hourly_crit = 0s

I only have it set to replicate once a day.

Hereā€™s the production host:

[data/media]
    use_template = production
    recursive = yes

###########################
# templates below this line #
###########################

[template_production]
       frequently = 0
       hourly = 48
       daily = 14
       weekly = 2
       monthly = 0
       yearly = 0
       autosnap = yes
       autoprune = yes

And hereā€™s the syncoid command Iā€™m using to pull.

0 14 * * * /usr/sbin/syncoid --recursive --no-sync-snap --no-privilege-elevation syncoid@10.13.37.101:data/media backup/media

I donā€™t understand why Iā€™m seeing autosnaps from the current date if the replication was failing? This is why my monitoring didnā€™t catch it.

backup/media@autosnap_2024-06-13_13:00:01_hourly     0B      -     8.88T  -
backup/media@autosnap_2024-06-13_14:00:01_hourly     0B      -     8.88T  -
backup/media@autosnap_2024-06-13_15:00:01_hourly     0B      -     8.88T  -
backup/media@autosnap_2024-06-13_16:00:29_hourly     0B      -     8.88T  -
backup/media@autosnap_2024-06-13_17:00:07_hourly     0B      -     8.88T  -
backup/media@autosnap_2024-06-13_18:00:01_hourly     0B      -     8.88T  -
backup/media@autosnap_2024-06-13_19:00:01_hourly     0B      -     8.88T  -
backup/media@autosnap_2024-06-13_20:00:01_hourly     0B      -     8.88T  -
backup/media@autosnap_2024-06-13_21:00:01_hourly     0B      -     8.88T  -
backup/media@autosnap_2024-06-13_22:00:01_hourly     0B      -     8.88T  -
backup/media@autosnap_2024-06-13_23:00:29_hourly     0B      -     8.88T  -
backup/media@autosnap_2024-06-14_00:00:00_daily      0B      -     8.88T  -
backup/media@autosnap_2024-06-14_00:00:00_hourly     0B      -     8.88T  -
backup/media@autosnap_2024-06-14_01:00:29_hourly     0B      -     8.88T  -
backup/media@autosnap_2024-06-14_02:00:29_hourly     0B      -     8.88T  -
backup/media@autosnap_2024-06-14_03:00:01_hourly     0B      -     8.88T  -
backup/media@autosnap_2024-06-14_04:00:01_hourly     0B      -     8.88T  -
backup/media@autosnap_2024-06-14_05:00:01_hourly     0B      -     8.88T  -
backup/media@autosnap_2024-06-14_06:00:01_hourly     0B      -     8.88T  -

Hopefully itā€™s some silly mistake on my part - thankfully this is just media!

Are you certain those are being taken locally on the target, and havenā€™t been replicated in from the source?

If youā€™ve got snapshots of the exact same name on the source, try zfs get guid poolname/media@snapshot on both source and target for one of the matching names. If the GUIDs match, those snapshots were taken on the source and replicated in on the target. If they donā€™t match, then they were locally taken on the target.

Seems theyā€™re totally different.

(the dataset name on the backup is different now because I used zfs rename to rename the old dataset while I do a full. 24 hours to go!)

backup

backup/media2@autosnap_2024-06-14_20:00:01_hourly  guid      1281541376264058739

production

data/media@autosnap_2024-06-14_12:00:01_hourly  guid      15485634101693316372 

The snapshot names also donā€™t all match up, which should be impossible if it were actually replicating.

backup/media2@autosnap_2024-06-14_20:00:01_hourly
data/media@autosnap_2024-06-14_12:00:01_hourly

Those are obviously not matching snapshots; thereā€™s no point in comparing GUIDs because they were taken at different times.

You need to compare two snapshots that were created at the same time.

Yeah thatā€™s my bad, misread the snapshot name entirely there.

Thereā€™s nothing here to compare:

this is from zfs list -t snap backup/media2

backup/media2@autosnap_2024-06-14_00:00:00_hourly     0B      -     8.88T  -
backup/media2@autosnap_2024-06-14_01:00:29_hourly     0B      -     8.88T  -
backup/media2@autosnap_2024-06-14_02:00:29_hourly     0B      -     8.88T  -
backup/media2@autosnap_2024-06-14_03:00:01_hourly     0B      -     8.88T  -
backup/media2@autosnap_2024-06-14_04:00:01_hourly     0B      -     8.88T  -
backup/media2@autosnap_2024-06-14_05:00:01_hourly     0B      -     8.88T  -
backup/media2@autosnap_2024-06-14_06:00:01_hourly     0B      -     8.88T  -
backup/media2@autosnap_2024-06-14_07:00:29_hourly     0B      -     8.88T  -
backup/media2@autosnap_2024-06-14_08:00:07_hourly     0B      -     8.88T  -
backup/media2@autosnap_2024-06-14_09:00:29_hourly     0B      -     8.88T  -
backup/media2@autosnap_2024-06-14_10:00:21_hourly     0B      -     8.88T  -
backup/media2@autosnap_2024-06-14_11:00:01_hourly     0B      -     8.88T  -
backup/media2@autosnap_2024-06-14_12:00:01_hourly     0B      -     8.88T  -
backup/media2@autosnap_2024-06-14_13:00:01_hourly     0B      -     8.88T  -
backup/media2@autosnap_2024-06-14_14:00:01_hourly     0B      -     8.88T  -
backup/media2@autosnap_2024-06-14_15:00:01_hourly     0B      -     8.88T  -
backup/media2@autosnap_2024-06-14_16:00:29_hourly     0B      -     8.88T  -
backup/media2@autosnap_2024-06-14_17:00:21_hourly     0B      -     8.88T  -
backup/media2@autosnap_2024-06-14_18:00:01_hourly     0B      -     8.88T  -
backup/media2@autosnap_2024-06-14_19:00:29_hourly     0B      -     8.88T  -
backup/media2@autosnap_2024-06-14_20:00:01_hourly     0B      -     8.88T

this is from zfs list -t snap data/media

data/media@autosnap_2024-06-14_00:00:01_daily      0B      -     8.89T  -
data/media@autosnap_2024-06-14_00:00:01_hourly     0B      -     8.89T  -
data/media@autosnap_2024-06-14_01:00:01_hourly     0B      -     8.89T  -
data/media@autosnap_2024-06-14_02:00:01_hourly     0B      -     8.89T  -
data/media@autosnap_2024-06-14_03:00:01_hourly     0B      -     8.89T  -
data/media@autosnap_2024-06-14_04:00:01_hourly     0B      -     8.89T  -
data/media@autosnap_2024-06-14_05:00:01_hourly     0B      -     8.89T  -
data/media@autosnap_2024-06-14_06:00:01_hourly     0B      -     8.89T  -
data/media@autosnap_2024-06-14_07:00:01_hourly     0B      -     8.89T  -
data/media@autosnap_2024-06-14_08:00:01_hourly     0B      -     8.89T  -
data/media@autosnap_2024-06-14_09:00:01_hourly     0B      -     8.89T  -
data/media@autosnap_2024-06-14_10:00:01_hourly     0B      -     8.89T  -
data/media@autosnap_2024-06-14_11:00:01_hourly     0B      -     8.89T  -
data/media@autosnap_2024-06-14_12:00:01_hourly     0B      -     8.89T  -
data/media@autosnap_2024-06-14_13:00:01_hourly     0B      -     8.89T  -
data/media@autosnap_2024-06-14_14:00:02_hourly     0B      -     8.89T  -
data/media@autosnap_2024-06-14_15:00:02_hourly     0B      -     8.89T  -
data/media@autosnap_2024-06-14_16:00:01_hourly     0B      -     8.89T  -
data/media@autosnap_2024-06-14_17:00:01_hourly     0B      -     8.89T  -
data/media@autosnap_2024-06-14_18:00:01_hourly     0B      -     8.89T  -
data/media@autosnap_2024-06-14_19:00:01_hourly     0B      -     8.89T  -
data/media@autosnap_2024-06-14_20:00:01_hourly     0B      -     8.89T  -
data/media@autosnap_2024-06-14_21:00:01_hourly     0B      -     8.89T  -
data/media@autosnap_2024-06-14_22:00:01_hourly     0B      -     8.89T  -

This is all EST, so there also shouldnā€™t be any snapshots at all on the backup host after 1400 hours, since thatā€™s when the replication runs every day.

Well, those two match, so itā€™s not like thereā€™s nothing to compare.

The fact that your source is managing to take its snapshots at 00:01 after the hour every single time and your target is frequently 28 seconds slower strongly implies that the ones on the target are being taken locally, thoughā€“even more so, if your source is SSD and your backup is rust, or in any other way your backup system or pool is noticeably slower than your prod.

The next step is going to be going over systemd timers and crontabs to make sure you arenā€™t invoking sanoid with an alternate sanoid.conf, or something. Because no, if youā€™re actually using that backup template as shown and itā€™s the only way that the dataset is referenced from sanoid.conf, you should not be seeing snapshots taken locally on the backup system.

You also need to be looking for any possible extra references to media2 in sanoid.confā€“for example, do you define a policy recursively for backup/ as well as a specific policy for backup/media2?

I compared those two snapshots just for a sanity check and they also have totally different GUIDs.

backup/media2@autosnap_2024-06-14_03:00:01_hourly  guid      8939584796767240006
data/media@autosnap_2024-06-14_03:00:01_hourly  guid      14303267950014548961

The name is probably just a coincidence.

what I posted earlier for sanoid.conf was the entire file, thereā€™s nothing underneath it.

I donā€™t have any other pools or datasets on this machine, itā€™s purely a backup for my media archive.

This is the only systemd timer I have for sanoid:

āÆ sudo systemctl list-units | grep sanoid
  sanoid.timer
 loaded active waiting   Run Sanoid Every 15 Minutes

I checked all the various users I have on this machine, including root, and didnā€™t find any other crontabs running Sanoid.

I also havenā€™t edited the systemd unit file, itā€™s the default one created by

sudo systemctl enable --now sanoid.timer.

Not sure what else I can check. I may just remove --no-sync-snap so that at least if this happens again Iā€™ll have a common snapshot and not have to start over from scratch.

I also apparently need to add in some sort of check to make sure that my SSH host keys havenā€™t changedā€¦ time to write some bash I suppose.

I just reproduced the same issue using 2 Ubuntu VMs and pools of sparse files.

apool = production host with the following sanoid.conf:

[apool/testdata]
        use_template = production
        recursive = yes


#############################
# templates below this line #
#############################

[template_production]
        frequently = 10
        hourly = 36
        daily = 30
        monthly = 3
        yearly = 0
        autosnap = yes
        autoprune = yes
        frequent_period = 5

bpool = backup host with this sanoid.conf:

[bpool/testdata]
        use_template = backups
        recursive = yes

###############
# Templates
###############

[template_backups]
        frequently = 20
        hourly = 36
        daily = 30
        monthly = 3
        yearly = 0
        autosnap = no
        autoprune = yes

Running syncoid every 5 minutes with this command.

*/5 * * * * /usr/sbin/syncoid --recursive --no-privilege-elevation --no-sync-snap syncoid@172.16.10.173:apool/testdata bpool/testdata

I let it run for 30 minutes or so to get some valid snapshots in there, which I verified by checking the GUIDs:

root@sanoid:/home/scott# zfs get guid apool/testdata@autosnap_2024-06-14_19:54:56_monthly
NAME                                                 PROPERTY  VALUE  SOURCE
apool/testdata@autosnap_2024-06-14_19:54:56_monthly  guid      6177939636300960487  -
syncoid@syncoid:/tmp$ zfs get guid bpool/testdata@autosnap_2024-06-14_19:54:56_monthly
NAME                                                 PROPERTY  VALUE  SOURCE
bpool/testdata@autosnap_2024-06-14_19:54:56_monthly  guid      6177939636300960487  -

I then manually regenerated the SSH host keys and waited a bit more time, and as you can see there are snapshots on the bpool host that do not match the source.

root@sanoid:/home/scott# zfs get guid apool/testdata@autosnap_2024-06-15_00:45:01_frequently
NAME                                                    PROPERTY  VALUE  SOURCE
apool/testdata@autosnap_2024-06-15_00:45:01_frequently  guid      656759613778615855  -
syncoid@syncoid:/tmp$ zfs get guid bpool/testdata@autosnap_2024-06-15_00:45:01_frequently
NAME                                                    PROPERTY  VALUE  SOURCE
bpool/testdata@autosnap_2024-06-15_00:45:01_frequently  guid      401481948375157999  -

it seems to be specifically calling sanoid --cron that generates the ā€œghostā€ snapshots. Weird.

ooooh. I may have an idea of whatā€™s happening here.

The sanoid.conf had a trailing space on the value of autosnap and I donā€™t think it was parsing it.

'backup/media' => {
                              'autoprune' => 1,
                              'autosnap' => 'no ',
                              'capacity_crit' => '95',
                              'capacity_warn' => '80',
                              'daily' => '21',

When I remove the trailing space, it displays correctly.

 'backup/media' => {
                              'autoprune' => 1,
                              'autosnap' => 0,
                              'capacity_crit' => '95',
                              'capacity_warn' => '80',
                              'daily' => '21',
3 Likes

That would certainly do it!

Yeah, nothing made sense until I ran sanoid manually with --debug.

And I managed to recreate the problem on the VMs because I copy pasted the config files. :person_facepalming:

Well, thatā€™s a lesson learned I guess.