Sanoid --monitor-snapshots "No valid lockfile found"

SirGeorge · May 3, 2024, 1:55am

Via /etc/cron.d I’m running the following script:

PATH=/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin

# Run every hour.
0 */1 * * * root /root/monitor-snapshots.sh

/root/monitor-snapshots.sh:

#!/bin/bash

SANOID=/usr/sbin/sanoid

if ! MONRESULT=$($SANOID --monitor-snapshots); then
        echo $MONRESULT | mutt -s "Error:  Snapshot policy problem." -- postmaster@mymachine.local
fi

On the physical box which runs this script (a Proxmox host) I’ve installed sSMTP with the following config:

root=postmaster
mailhub=mailrise.local:8025
hostname=storagebox.local
UseTLS=No
UseSTARTTLS=No
FromLineOverride=YES

And mailrise.local is a separate VM running mailrise which acts as an SMTP gateway for Discord.

Randomly, I get the following errors:

Cron <root@pve> /root/monitor-snapshots.sh  (root)
ERROR: No valid lockfile found - Did a rogue process or user update or delete it?
sendmail: 450 failed to send notification
Error sending message, child exited 1 ().
Could not send the message.

After getting such an alert, if I ssh into the machine and run syncoid --monitor-snapshots manually - there is no error.

My /etc/sanoid/sanoid.conf has multiple datasets all configured like this:

[tank/some-dataset]
        use_template = shortTemplate

[other/some-dataset]
        use_template = shortTemplate
        autosnap = no
        hourly_warn = 1500m
        hourly_crit = 1500m
        daily_warn = 36h
        daily_crit = 48h

[template_shortTemplate]
        frequently = 0
        hourly = 1
        daily = 3
        weekly = 3
        monthly = 3
        yearly = 0
        autosnap = yes
        autoprune = yes

tank is a 2-disk mirror and other is a 2-disk mirror. syncoid is used to send snapshots from tank → other, and I have sanoid --monitor-snapshots running to ensure other gets those snapshots properly.

I’ll admit I’m probably doing something wrong here, and this setup (with sSMTP and mailrise in the mix) isn’t terrible straightforward. I use this mailrise VM to funnel a lot of other notifications into Discord, though, and it works fine in those cases.

Appreciate any pointers in the right direction on how to go about tracing this down, if it is a problem with how I’ve setup sanoid.

mercenary_sysadmin · May 3, 2024, 1:10pm

I suspect it means exactly what it says: something is messing with your lock file, most likely proxmox periodically deletes it for some reason.

SirGeorge · May 3, 2024, 2:01pm

Thanks! Sanoid is writing its locks at $run_dir/$lockname.lock where $run_dir = /var/run/sanoid, correct?

EDIT: To add a little more testing as I try to figure out where/how a .lock file is being deleted.

In Terminal #1, I’ll run:

inotifywait -m /var/run/sanoid/

And then in Terminal #2 I’ll try any of the following:

sanoid --monitor-snapshots
/root/monitor-snapshots.sh

Neither generates any output in Terminal #1. However, touch /var/run/sanoid/testfile does.

root:~# inotifywait -m /var/run/sanoid/
Setting up watches.
Watches established.
/var/run/sanoid/ CREATE testfile
/var/run/sanoid/ OPEN testfile
/var/run/sanoid/ ATTRIB testfile
/var/run/sanoid/ CLOSE_WRITE,CLOSE testfile
/var/run/sanoid/ DELETE testfile

Does sanoid --monitor-snapshots create a lock file? Am I looking in the wrong place var/run/sanoid?

mercenary_sysadmin · May 3, 2024, 7:26pm

/var/run/sanoid is normally where lockfiles live, but it’s possible that Proxmox is either messing with its contents, or however you’ve got sanoid installed / whatever user context it’s running under, it does not have privileges to that folder.

More importantly, sanoid --monitor-snapshots won’t mess with the lockfile in any way, because it’s not an operation that either blocks other operations or is blocked by other operations, so it can just run regardless.

The lockfile only gets invoked when sanoid is creating or destroying snapshots, or (more importantly) when regenerating its internal cache. You can always trigger a cache update on demand, so instead try sanoid --force-update:

root@elden:/# screen -S inotify
root@elden:/# inotifywait -m /var/run/sanoid
Setting up watches.
Matches established.
[now I press ctrl-A, then ctrl-D to leave the screen session]

root@elden:/# sanoid --force-update --verbose
INFO: cache forcibly expired - updating from zfs list.

[now I re-enter the screen session to see what inotifywait has to say]
root@elden:/# screen -dr
/var/run/sanoid/ CREATE sanoid_cacheupdate.lock
/var/run/sanoid/ OPEN sanoid_cacheupdate.lock
/var/run/sanoid/ MODIFY sanoid_cacheupdate.lock
/var/run/sanoid/ CLOSE_WRITE,CLOSE sanoid_cacheupdate.lock
/var/run/sanoid/ OPEN sanoid_cacheupdate.lock
/var/run/sanoid/ ACCESS sanoid_cacheupdate.lock
/var/run/sanoid/ CLOSE_NOWRITE,CLOSE sanoid_cacheupdate.lock
/var/run/sanoid/ DELETE sanoid_cacheupdate.lock

And there you go.

SirGeorge · May 3, 2024, 8:26pm

I confirmed that /var/run/sanoid is where the lockfiles are being stored using that --force-update trick.

If I craft a scenario where the output of sanoid --monitor-snapshots should fire with an error, then I get a valid alert. Setting the hourly_warn and hourly_crit to 1m, for example.

[other/some-dataset]
        use_template = shortTemplate
        autosnap = no
        hourly_warn = 1m
        hourly_crit = 1m
        daily_warn = 36h
        daily_crit = 48h

Correctly fires off this error:

Error:  Snapshot policy on mymachine. (root <root@mymachine.lan>)
CRIT: other/some-dataset newest hourly snapshot is 10h 8m 52s old (should be < 1m 0s)

And as you said, sanoid --monitor-snapshots is not touching lockfiles.

Think I’ll need to deep-dive more into my notification setup to try and solve why my monitor-snapshots.sh script, which simply runs sanoid --monitor-snapshots inside it, is occasionally tossing out these errors. I have plenty of sendmail 450 errors in /var/log/mail.log to explore…

Cron <root@pve> /root/monitor-snapshots.sh  (root)
ERROR: No valid lockfile found - Did a rogue process or user update or delete it?
sendmail: 450 failed to send notification
Error sending message, child exited 1 ().
Could not send the message.

Appreciate the help here regarding sanoid, it’s an excellent utility!

mercenary_sysadmin · May 3, 2024, 8:33pm

At this point, the best I can give you is confirmation that the lockfile being complained about is Sanoid’s, and not something belonging to mutt:

jrs@elden:~$ grep -i "no valid lockfile" /usr/local/bin/sanoid
		die "ERROR: No valid lockfile found - Did a rogue process or user update or delete it?\n";

But we’re still left with “either something is destroying Sanoid’s lockfiles, or possibly you’re running Sanoid as a non-root user with insufficient privileges in /var/run/sanoid.”