Enable dedup for MySQL backups dataset only?

minWi · February 17, 2025, 2:00pm

Currently I have a daily backup of a MySQL database (nextcloud) stored on an dataset that looks like this:

-rw-r--r--  1 foo foo 646076573 Feb 14 23:30 nextcloud-sqlbkp_20250214-2330.bak
-rw-r--r--  1 foo foo 646076563 Feb 15 23:30 nextcloud-sqlbkp_20250215-2330.bak
-rw-r--r--  1 foo foo 646076572 Feb 16 23:30 nextcloud-sqlbkp_20250216-2330.bak

For the most part, the files are 99% pretty similar (text dump with not too much changes on a daily basis, like /usr/bin/mariadb-dump --single-transaction --default-character-set=utf8mb4 -u foo --password='password' nextcloud > /var/lib/mysql-backup/backups/nextcloud-sqlbkp_$(date +"%Y%m%d-%H%M").bak) so I was wondering if enabling dedup will improve the storage usage. Currently with lz4 compression it looks like:

# du --apparent-size -h .
183G    .
# du -hs .
74G     .

But it would be nice to squeeze those numbers if possible

I’m really scared on enabling dedup so I thought to ask first

Thanks!

HankB · February 17, 2025, 3:34pm

My opinion:

The “old” dedup [*] does so on a block by block basis and has a reputation for needing gobs of RAM. For this reason I’ve never tried it. In your case, “not too much changes” means they are not duplicates unless the files are large enough that duplicate contents would be stored on the same block boundaries. [**] You could try it out if you have sufficient resources (like an extra host that could be used for testing.)

Text files should compress well and you are using compression so that’s good.

[*] There is a newer dedup strategy on the horizon that may work better, but I haven’t looked into it and don’t know if it is in a release yet. Even if available, I’d let others give it a try before I employed it.

[**] Probably less likely when compression is employed.

mercenary_sysadmin · February 17, 2025, 11:35pm

minWi:

For the most part, the files are 99% pretty similar (text dump with not too much changes on a daily basis, like /usr/bin/mariadb-dump --single-transaction --default-character-set=utf8mb4 -u foo --password='password' nextcloud > /var/lib/mysql-backup/backups/nextcloud-sqlbkp_$(date +"%Y%m%d-%H%M").bak) so I was wondering if enabling dedup will improve the storage usage. Currently with lz4 compression it looks like:
# du --apparent-size -h .
183G    .
# du -hs .
74G     .
But it would be nice to squeeze those numbers if possible

I’m really scared on enabling dedup so I thought to ask first

Simple. Dedup is a per-dataset feature, NOT a per-pool feature, so just enable it on a new, empty dataset, copy your files into it, and see what the result looks like.

If you’re happy with it, keep the new dataset with its dedup setting, and get rid of the old one. If you’re not happy with it, zfs destroy the deduplicated dataset entirely, and that’s that for deduplication, no lingering impact.

minWi · February 19, 2025, 8:24am

Thanks, I’ll try to do that.

In the meantime I found zdb -S[1] to simulate dedup but unfortunately it seems it only works for pools, not datasets.

[1] zdb.8 — OpenZFS documentation

minWi · February 19, 2025, 9:29am

I enabled it and copied the data but I’m not 100% sure how to check if that was effective or not Every site I’ve read they say dedup ratio is shown per pool, not per dataset.

# zdb -DD tank
DDT-sha256-zap-duplicate: 199461 entries, size 744 on disk, 165 in core
DDT-sha256-zap-unique: 271786 entries, size 682 on disk, 151 in core

DDT histogram (aggregated over all DDTs):

bucket              allocated                       referenced
______   ______________________________   ______________________________
refcnt   blocks   LSIZE   PSIZE   DSIZE   blocks   LSIZE   PSIZE   DSIZE
------   ------   -----   -----   -----   ------   -----   -----   -----
     1     265K   33.2G   11.3G   11.9G     265K   33.2G   11.3G   11.9G
     2    87.8K   11.0G   3.66G   3.89G     202K   25.3G   8.40G   8.93G
     4    67.1K   8.39G   3.34G   3.49G     323K   40.4G   16.1G   16.9G
     8    29.2K   3.65G   1.50G   1.55G     316K   39.5G   16.3G   16.9G
    16    4.39K    562M    221M    230M     116K   14.5G   5.79G   6.03G
    32    6.30K    806M    329M    340M     246K   30.8G   12.5G   12.9G
    64       12   1.50M    184K    232K      916    114M   13.7M   17.3M
   128        2    256K     28K   34.9K      318   39.8M   4.35M   5.42M
   256        1    512B    512B   5.81K      309    154K    154K   1.75M
 Total     460K   57.5G   20.3G   21.4G    1.44M    184G   70.4G   73.6G

dedup = 3.44, compress = 2.61, copies = 1.04, dedup * compress / copies = 8.60

space map refcount mismatch: expected 221 != actual 133

This is the closest thing I’ve seen:

# zfs list -o name,used,logicalused tank/nextcloud-db-backup
NAME                       USED  LUSED
tank/nextcloud-db-backup   146G   365G

# zfs list -o name,used,logicalused tank/dbdedup
NAME           USED  LUSED
tank/dbdedup  73.7G   184G

However:

# du --apparent-size -h .
183G    .
# du -h .
74G     .

Am I missing something?

mercenary_sysadmin · February 21, 2025, 12:15am

dedupratio is per pool, but if you’ve only enabled it on one dataset… well, you know how much data you have in the pool, and you know how much data you have in that one dataset. it’s a pretty simple pair of operations to correct from there.

dedupratio == logical number of blocks needed to store data / physical blocks used to store data

So you divide by the number of blocks logically needed to store all data in the pool, then multiply by the number of blocks logically needed to store all data in the dataset you deduplicated. Presto, dedupratio for the dataset itself.

Alternately, since you had to copy all that data into the deduplicated dataset FROM a non-deduplicated dataset in the first place… just compare the USED for each dataset!

minWi · February 24, 2025, 11:02am

I still cannot do the math

# dedupratio == logical number of blocks needed to store data / physical blocks used to store data

3.44 == logical number of blocks needed to store data / physical blocks used to store

Logical blocks on the pool itself:

root@nas:~# zdb -bb tank

Traversing all blocks to verify nothing leaked ...

loading concrete vdev 0, metaslab 115 of 116 ...
2.42T completed (2158MB/s) estimated time remaining: 2264423hr 16min 53sec
        No leaks (block sum matches space maps exactly)

        bp count:              18750487
        ganged count:                 0
        bp logical:       1708383833088      avg:  91111
        bp physical:      1271393334784      avg:  67805     compression:   1.34
        bp allocated:     2676120367104      avg: 142722     compression:   0.64
        bp deduped:        115752923136    ref>1: 199461   deduplication:   1.04
        bp cloned:                    0    count:      0
        Normal class:     2560366497792     used: 32.40%
        Embedded log class        1056768     used:  0.00%

        additional, non-pointer bps of type 0:    3632790
        Dittoed blocks on same vdev: 784876
        Dittoed blocks in same metaslab: 1

Blocks  LSIZE   PSIZE   ASIZE     avg    comp   %Total  Type
     -      -       -       -       -       -        -  unallocated
     2    32K      8K     72K     36K    4.00     0.00  object directory
     2     1K      1K     72K     36K    1.00     0.00  object array
     1    16K      4K     36K     36K    4.00     0.00  packed nvlist
     -      -       -       -       -       -        -  packed nvlist size
 2.25K   288M   18.4M    165M   73.5K   15.60     0.01  bpobj
     -      -       -       -       -       -        -  bpobj header
     -      -       -       -       -       -        -  SPA space map header
 11.7K  55.5M   49.4M    439M   37.4K    1.12     0.02  SPA space map
     3   268K    268K    564K    188K    1.00     0.00  ZIL intent log
  436K  6.97G   1.74G   10.4G   24.4K    4.01     0.42  DMU dnode
   104   414K    414K   2.45M   24.1K    1.00     0.00  DMU objset
     -      -       -       -       -       -        -  DSL directory
     9     5K      1K     36K      4K    5.00     0.00  DSL directory child map
    12   161K     52K    468K     39K    3.10     0.00  DSL dataset snap map
    15   194K     48K    432K   28.8K    4.03     0.00  DSL props
     -      -       -       -       -       -        -  DSL dataset
     -      -       -       -       -       -        -  ZFS znode
     -      -       -       -       -       -        -  ZFS V0 ACL
 15.4M  1.54T   1.15T   2.42T    161K    1.34    99.39  ZFS plain file
 2.02M  1.82G    175M   2.69G   1.33K   10.65     0.11  ZFS directory
     6     3K      3K    144K     24K    1.00     0.00  ZFS master node
     -      -       -       -       -       -        -  ZFS delete queue
     -      -       -       -       -       -        -  zvol object
     -      -       -       -       -       -        -  zvol prop
     -      -       -       -       -       -        -  other uint8[]
     -      -       -       -       -       -        -  other uint64[]
     -      -       -       -       -       -        -  other ZAP
     -      -       -       -       -       -        -  persistent error log
 8.01K  1.00G   89.9M    808M    101K   11.40     0.03  SPA history
     -      -       -       -       -       -        -  SPA history offsets
     -      -       -       -       -       -        -  Pool properties
     -      -       -       -       -       -        -  DSL permissions
     -      -       -       -       -       -        -  ZFS ACL
     -      -       -       -       -       -        -  ZFS SYSACL
     -      -       -       -       -       -        -  FUID table
     -      -       -       -       -       -        -  FUID table size
     1  1.50K   1.50K     36K     36K    1.00     0.00  DSL dataset next clones
     -      -       -       -       -       -        -  scan work queue
   228   137K     46K   1.08M   4.84K    2.98     0.00  ZFS user/group/project used
     -      -       -       -       -       -        -  ZFS user/group/project quota
     -      -       -       -       -       -        -  snapshot refcount tags
 18.3K  73.0M   73.0M    657M     36K    1.00     0.03  DDT ZAP algorithm
     2    32K      8K     72K     36K    4.00     0.00  DDT statistics
     -      -       -       -       -       -        -  System attributes
     -      -       -       -       -       -        -  SA master node
     6     9K      9K    144K     24K    1.00     0.00  SA attr registration
    14   224K     56K    336K     24K    4.00     0.00  SA attr layouts
     -      -       -       -       -       -        -  scan translations
     -      -       -       -       -       -        -  deduplicated block
   243  1020K    759K   7.66M   32.3K    1.34     0.00  DSL deadlist map
     -      -       -       -       -       -        -  DSL deadlist map hdr
     -      -       -       -       -       -        -  DSL dir clones
    13  1.62M     52K    468K     36K   32.00     0.00  bpobj subobj
     -      -       -       -       -       -        -  deferred free
     -      -       -       -       -       -        -  dedup ditto
   266   931K   71.5K    720K   2.71K   13.02     0.00  other
 17.9M  1.55T   1.16T   2.43T    139K    1.34   100.00  Total
 2.66M  33.3G   3.24G   21.1G   7.93K   10.28     0.85  Metadata Total

Block Size Histogram

  block   psize                lsize                asize
   size   Count   Size   Cum.  Count   Size   Cum.  Count   Size   Cum.
    512:   455K   228M   228M   455K   228M   228M      0      0      0
     1K:   128K   164M   392M   128K   164M   392M      0      0      0
     2K:   108K   268M   660M   108K   268M   660M      0      0      0
     4K:   899K  3.77G  4.42G   258K  1.45G  2.10G      0      0      0
     8K:  1.02M  10.0G  14.4G   487K  5.25G  7.34G   750K  8.79G  8.79G
    16K:   523K  10.3G  24.7G   693K  12.1G  19.5G  1.31M  31.4G  40.2G
    32K:  3.19M   169G   193G  93.2K  3.98G  23.5G  1.02M  41.3G  81.5G
    64K:   954K  68.4G   262G  49.8K  4.41G  27.9G  1.89M   191G   273G
   128K:  7.21M   922G  1.16T  12.2M  1.52T  1.55T  2.26M   317G   590G
   256K:      0      0  1.16T      0      0  1.55T  7.21M  1.86T  2.43T
   512K:      0      0  1.16T      0      0  1.55T     13  7.14M  2.43T
     1M:      0      0  1.16T      0      0  1.55T      0      0  2.43T
     2M:      0      0  1.16T      0      0  1.55T      0      0  2.43T
     4M:      0      0  1.16T      0      0  1.55T      0      0  2.43T
     8M:      0      0  1.16T      0      0  1.55T      0      0  2.43T
    16M:      0      0  1.16T      0      0  1.55T      0      0  2.43T

space map refcount mismatch: expected 221 != actual 133

Recordsize for the dataset is 128K

root@nas:~# zfs get recordsize tank/dbdedup
NAME          PROPERTY    VALUE    SOURCE
tank/dbdedup  recordsize  128K     default

Then:

# blocks = bp_logical / block_size

root@nas:~# echo $((1708383833088/131072))
13033934

Logical blocks used on the pool = 13033934

Now, for the logically used on the dataset:

root@nas:~# echo $(( $(zfs get -Hp -o value logicalused tank/dbdedup) / $(zfs get -Hp -o value recordsize tank/dbdedup) ))
1506427

I see the bp deduped: 115752923136 number as well but I feel stupid trying to put all these things together

mercenary_sysadmin · February 24, 2025, 5:37pm

You seem to me to be getting about a 2:1 dedup ratio.