Currently I have a daily backup of a MySQL database (nextcloud) stored on an dataset that looks like this:
-rw-r--r-- 1 foo foo 646076573 Feb 14 23:30 nextcloud-sqlbkp_20250214-2330.bak
-rw-r--r-- 1 foo foo 646076563 Feb 15 23:30 nextcloud-sqlbkp_20250215-2330.bak
-rw-r--r-- 1 foo foo 646076572 Feb 16 23:30 nextcloud-sqlbkp_20250216-2330.bak
For the most part, the files are 99% pretty similar (text dump with not too much changes on a daily basis, like /usr/bin/mariadb-dump --single-transaction --default-character-set=utf8mb4 -u foo --password='password' nextcloud > /var/lib/mysql-backup/backups/nextcloud-sqlbkp_$(date +"%Y%m%d-%H%M").bak) so I was wondering if enabling dedup will improve the storage usage. Currently with lz4 compression it looks like:
# du --apparent-size -h .
183G .
# du -hs .
74G .
But it would be nice to squeeze those numbers if possible
I’m really scared on enabling dedup so I thought to ask first
The “old” dedup [*] does so on a block by block basis and has a reputation for needing gobs of RAM. For this reason I’ve never tried it. In your case, “not too much changes” means they are not duplicates unless the files are large enough that duplicate contents would be stored on the same block boundaries. [**] You could try it out if you have sufficient resources (like an extra host that could be used for testing.)
Text files should compress well and you are using compression so that’s good.
[*] There is a newer dedup strategy on the horizon that may work better, but I haven’t looked into it and don’t know if it is in a release yet. Even if available, I’d let others give it a try before I employed it.
[**] Probably less likely when compression is employed.
Simple. Dedup is a per-dataset feature, NOT a per-pool feature, so just enable it on a new, empty dataset, copy your files into it, and see what the result looks like.
If you’re happy with it, keep the new dataset with its dedup setting, and get rid of the old one. If you’re not happy with it, zfs destroy the deduplicated dataset entirely, and that’s that for deduplication, no lingering impact.
I enabled it and copied the data but I’m not 100% sure how to check if that was effective or not Every site I’ve read they say dedup ratio is shown per pool, not per dataset.
# zfs list -o name,used,logicalused tank/nextcloud-db-backup
NAME USED LUSED
tank/nextcloud-db-backup 146G 365G
# zfs list -o name,used,logicalused tank/dbdedup
NAME USED LUSED
tank/dbdedup 73.7G 184G
dedupratio is per pool, but if you’ve only enabled it on one dataset… well, you know how much data you have in the pool, and you know how much data you have in that one dataset. it’s a pretty simple pair of operations to correct from there.
dedupratio == logical number of blocks needed to store data / physical blocks used to store data
So you divide by the number of blocks logically needed to store all data in the pool, then multiply by the number of blocks logically needed to store all data in the dataset you deduplicated. Presto, dedupratio for the dataset itself.
Alternately, since you had to copy all that data into the deduplicated dataset FROM a non-deduplicated dataset in the first place… just compare the USED for each dataset!
# dedupratio == logical number of blocks needed to store data / physical blocks used to store data
3.44 == logical number of blocks needed to store data / physical blocks used to store