Thanks in advance for any advice!
I have an external ZFS backup pool connected via USB that I use to store Clonezilla images of entire drives (these drives aren’t ZFS, but ext4)
My source drive is 1TB, and my destination pool is 2TB, so storage capacity isn’t an issue. I’d like to optimize for space by doing incremental backups, and initially thought deduplication would be perfect, since I’d be making similar images of the same drive with periodic updates (about once a month). The idea was to keep image files named by their backup date, and rely on deduplication to save space due to the similarity between backups.
I tested this, and it worked quite well.
Now I’m wondering if deduplication is even necessary if I use snapshots. For example, could I take a snapshot before each overwrite, keeping a single image filename and letting ZFS snapshots preserve historical versions automatically? The Clonezilla options I’m using create images that are non-compressed and non-encrypted. I don’t need encryption, and the pool already has compression enabled.
Would using snapshots alone be more efficient, or is there still a benefit to deduplication in this workflow? I’d appreciate any advice! I’ve got lots of memory so that isn’t a concern. Maybe I should use both together?
thanks!
It depends on whether clonezilla modifies its backup files in place–in which case, yes, you could tell it just to keep a single copy and let snapshots preserve the old versions for you–or creates entirely new files with each run, in which case snapshots wouldn’t do you any good.
I don’t know how clonezilla works, but I suspect that at least by default, it creates entirely new files, for the same reason that rsync (which I do understand the internal workings of quite well) defaults to creating new files. (The reason is because patching files in place is likely to leave you with an inconsistent and effectively unusable file if a patching process is interrupted or crashes in the midst of a run.)
If that’s the case–that clonezilla creates new files rather than patching in-place–then dedup might work for you where simple snapshots would not.
Hello Jim, thanks for that. FYI, I did cross-post this question over on the ZFS Reddit, and one response was interesting. I didn’t know about NOP-writes, and assuming the author is correct, it seems logical how that might affect doing snapshots with Clonezilla images made using dd.
Of course the bottom line seems to be that dedups are the way to go in this regard for my use case.
You’re still overwriting the entire file. Even if it’s the same contents ZFS by default treats it as “new data”. Therefore snapshots wouldn’t work without changing some settings.
There is however a possible optimization: If you manually set the checksum type to a strong hash (like SHA256) and have compression enabled ZFS can perform a “nop-write”. If the checksum of the data that’s being overwritten matches the new incoming data it skips writing it out. That’d mean it won’t take any extra space.
It’s basically like dedup, but with one massive limitation: It has no context but the current write. So if things move around or are simply just shifted (offset) by a record or two it won’t work. It has to be the exact same record at the exact same position in the file.
That’s less of an issue with .raw (dd style) disk images, where a 1TB backup is just a 1:1, block to block, 1TB logical file size data dump including unallocated blocks. But Clonezilla iirc. performs some optimizations to not actually include every block in the image, even if compression is off. Therefore things could easily “shift around” and break nop-write.
1 Like