Is that user’s cron environment messed up? Bad shell, maybe?
Place the username (normally root but in your case different …) between the scheduling time stars and /usr/local/bin/syncoid …
This may be a typo, but I just noticed you’re missing a forward slash in your syncoid command:
serveruser@my-server:my-pool/vm backups/vm
should probably be
serveruser@my-server:/my-pool/vm backups/vm
Technically, no it was not a typo. That is how it is in my crontab.
However, it got me rereading and rereading and rereading every detail of both my command and my crontab. This led me to try writing out and running the command exactly as written in the crontab, but via the CLI.
The first thing that came up was that I got an error that the --verbose
argument was not recognized, so I removed that from the command and it ran fine from the CLI. Interestingly enough, I am unable to find anything in the GitHub wiki mentioning --verbose
as a valid argument.
Next, I added in the full path to syncoid, like it’s written in the crontab. I did this separately, instead of combined with the above to eliminate one potential problem at a time.
This next step resulted in /usr/local/bin/syncoid: command not found
. I found this a bit strange as I had successfully run syncoid --no-privilege-elevation -r serveruser@my-server:my-pool/vm backups/vm
. So, I used the handy little whereis syncoid
and discovered it is at /usr/local/sbin/syncoid
. Changing the path in the CLI, things ran again as expected.
From here, I then edited my testing cronjob that was set to run once a minute, with everything I discovered above plus @bladewdr’s suggestion to include the forward slash. I had to step away from my computer for a few minutes to take care of something. When I came back and checked my logs, it was rapidly filling with errors from my testing cronjob! Now, I was getting somewhere.
The log showed me a warning (which I’ll get to in a second) and an error. The error was ERROR: cannot open 'my-pool/vmmy-pool/vm': no such file or directory
. I went back and removed the recommended forward slash and waited a couple of minutes. Checking the logs again and I see Sending incremental my-pool/vm@syncoid_backup_blah_blah_blah
! Success!!! The cronjob is now running.
The warning was:
WARN: ZFS resume feature not available on source machine - sync will continue without resume support.
Despite the warning, things do run fine. Is this something to be concerned about? I checked the docs and didn’t find much information about that.
Checking zfs list -rt all backups/vm
now reveals that all of the incremental backups from source are found on target!
The thought had crossed my mind. However, this was a brand new install of Ubuntu 24.04 about 6 weeks ago and this was my first cronjob that I’ve setup on this box. I also tested things out in both bash and fish and didn’t have any problems in either.
On the one hand, I would mark @bladewdr’s response as the solution, because it is what pointed me in the correct direction to track down the issues with the command. On the other hand, I don’t want to mark it as the solution because it actually caused another error in the command.
I’ve now deleted my test cronjob and edited my main one to run at 00:30 hours and will verify tomorrow if it’s correct.
Thanks everyone for the help!
Edited:
So, I had initially written a frustrated reply because things weren’t working. After taking a break and doing my actual job, I came back to it and discovered I’d made a stupid mistake. I misspelled my own pool name. So stupid.
I’m still seeing this:
WARN: ZFS resume feature not available on source machine - sync will continue without resume support.
Is this of any concern?
Means exactly what it says on the tin. The resume feature isn’t supported on the source machine, so if you interrupt a sync in the middle, you’ll lose some progress when you pick it back up again, whereas if resume was supported, it would just pick right back up from where it left off.
You may still enjoy SOME degree of “resumability” even without it; for example if you’re replicating a stream with 24 hourlies in it and you successfully replicate 20 of the 24 hourlies prior to the stream getting interrupted, when you run the sync again you will not have to resync the 20 you already completed–but if you were 99% of the way through replicating the 21st snapshot, you’ll have to start over at 0% of that one, whereas if you had resume you’d pick right up at 20 done and 99% of the 21st done.
The reason resume isn’t supported is simply that the ZFS on that side is too old to offer that feature. Upgrade the ZFS packages on that system, if you want resume support.
Interesting. The source system is actually newer than destination. That’s a brand new server that I just built a couple of weeks ago with Ubuntu 24.04 and imported my pool from my old server. I’ll check packages again, but I thought they were current.
@mercenary_sysadmin Just as follow up. I checked both source and destination boxes and both have version 2.2.2-0ubuntu9
installed. I realize that package is from November 2023 and per GitHub, current is 2.2.4
as of May. I cannot find anything newer in the Ubuntu repos - unless I’m missing something.
Was this pool originally on another system with an older version of zfs by chance? You may need to upgrade the pool itself.
I believe it’s zpool upgrade
.
Thanks for the quick response. Yes, you are absolutely correct. The pool was originally in a system with Ubuntu 18.04 that I stopped using due to memory limitations and imported into this new server.
I’ll see what happens.
Edit: Okay, I checked the source and there were 5 features that were not enabled there, that are enabled on the destination machine. I’ve gone ahead and upgraded, so I’ll see if the warning shows up any longer.
Thanks for the tip @bladewdr, that helped. I was unaware of that.
I just wanted to offer a quick follow up. Since I upgraded my source machine per @bladewdr’s suggestion my errors are gone and after I spelled the name of my source pool on the destination pull request correctly, syncoid has been quietly and effectively pulling backups. Not to mention, at @mercenary_sysadmin’s suggestion, I also have a handy little sync log.
Thanks for all the recommendations, everyone. This is brilliant.
Okay, so today I guess I will be putting my newly learnt ZFS skills to the test and restore a snapshot to pull out my qcow2 image and restore it.
Earlier today, I upgraded one of my VM’s on the server discussed in this topic from Ubuntu 22.04 → to Ubuntu 24.04. Unbeknownst to me, this ended up borking my Nextcloud installation.
If I understand what was shared above, all that is needed for me to access a snapshot to do said extraction is:
Pool name I know, snapname
is easy enough to figure out from the list. What is dsname
?
the name of the dataset in question… but just to be clear, you had to take the snapshot before you broke the VM. If you don’t already HAVE a snapshot you took prior to breaking it, creating one now AFTER you broke it won’t do you any good.
Hmmmm, well this could be interesting.
What about the automatic hourlies run by sanoid? Or do those not count? Or the daily from sanoid taken last night?
Edit: I thought that was one of the reasons for setting up sanoid?
Those absolutely count! In which case you’re stopping your VM, doing a zfs list -rt snap mypool/mydataset
and identifying a sanoid
snapshot taken in the time frame you want to roll back to, and then you just zfs rollback -R mypool/mydataset@mysnapshot
then re-start the VM.
The only thing to warn you about is a rollback is a DESTRUCTIVE operation; you WILL permanently lose all snapshots and data newer than the one you roll back to. So if you’re in doubt, you might want to do something sneaky like an rsync --inplace from a clone of the snapshot onto the current filesystem (again, after shutting down the VM), which will still effectively “roll back” the VM to the condition it was in in the snapshot, but won’t destroy any data snapshotted since then.
The big drawback to rsync’ing from a clone (or directly from a read-only mounted snapshot) instead of just rolling back is that it takes a significant amount of time–can easily be hours, for moderately large images. By contrast, a rollback is <1sec even when it means changing TiB of data on a slow pool.
Brilliant! Thank you.
While I saw you replying, I did confirm that I do have access to the Nextcloud data directory. So if all else failed, I could just spin up a new install and transfer the data over. Similar to our previous discussion. Just a base OS and I don’t care about it. Only the data directory.
So, to make sure I’m clear. I want to repeat that you’re saying that process rolls back ALL VMs and not just the one in question?
The rsync would definitely be time intensive. This is over 500gb in Nextcloud.
That depends on whether you placed ALL VMs in the same dataset or not.
Let’s say I’ve got three VMs, stored in pool/images/vm0
, pool/images/vm1
, and pool/images/vm2
. I’ve also got Sanoid running, so I’ve got three monthlies, thirty dailies, and thirty hourlies of each individual dataset.
Now, let’s say vm0 craps the bed. I determine that it crapped the bed on the second of the month, so I decide to roll back to the most recent monthly. Doing so does not affect vm1 or vm2 at all–but I lose all of the 30 hourlies on vm0, as well as all of the dailies that were taken since the monthly that I’m rolling back to.
On the other hand, if I only have one dataset with all three VMs in it–even if they’re in separate directories beneath the VM–I can’t roll them back independently.
Reading your response brought on one of those moments of loathing one’s own ignorance and stupid questions.
Of course what you say makes sense. And I should have realized this before even asking. I guess I’ll be creating a few new pools and updating sanoid.
If I understand another topic on here correctly, running syncoid just once should pull all the snapshots, once I implement the changes?
Yes, a default syncoid run will replicate every snapshot present (if a full replication, meaning beginning with an empty target) and every snapshot newer than the most recent common snapshot (in the case of an incremental, meaning the target already exists).
Thanks @mercenary_sysadmin, the rollback took less than 1 second and everything is restored. Much faster than standing up a new instance and copying over.