I have a directory in a dataset that has nearly 1TB of files nested within it. I happen to know that there is over 100GB of duplicate files in it. I started reading up of rmlint to hardlink the duplicates together. But, then I thought I’d see what fast dedup could do (it would certainly be easier and less error prone!).
So, I created a new dataset, set deduplicaion=true, and ran cp -a old-dir new-dataset.
Here’s my question: How can I find the deduplication statistics for my new dataset? I can only find info for the zpool as a whole using zpool list or zdb -DD. The zpool level info isn’t very useful to me. If we can configure dedup on a dataset level, shouldn’t we be able to view dedup info an a dataset level too?
Am I missing something?
[quote=“hayays, post:1, topic:4468”]
How can I find the deduplication statistics for my new dataset?[/quote]
You can’t, because that’s not how dedup works. Dedup is pool-wide, not per dataset. If you download a copy of Huckleberry Finn and place it in pool/ds1, then cp it to pool/ds2, it will be deduplicated (if dedup is on in pool/ds2).
But the deduplicated blocks belong to multiple datasets. Neither pool/ds1 nor pool/ds2 have multiple copies of the same block, so neither dataset would have any “deduplication” if you tried to look at it on a per dataset basis–but there’s only one copy of each block, and each block is used in both datasets.
Now imagine how you’d try to track dedup on a per dataset level anyway, despite the fact that ten datasets might all share a single block with 0, 2, or 100 additional copies of that block per each dataset that block lands in.
That might be doable if you very specifically designed the feature that way from before you ever wrote a single line of code. But as a bolt-on later, it seems unlikely. You’d have to keep and update per-dataset counters, as far as I can tell at least, and every extra write is an extra IOP spent that you don’t have available for something more important.
Your answer makes sense and is what is implied by the docs, but it’s unintuitive to me that dedup is managed on the dataset level, but acts on the pool level. It’s confusing to my little brain, but I’ll accept that bigger brains than mine have designed dedup logically.
1 Like
I get that! But here’s why: at the moment you save a file, the pool knows exactly which dataset it belongs in. So it’s easy to just not make an entry in the dedup table (or look for entries pre-existing in the dedup table) before writing that block.
But once the block has already been written, in order to have per-dataset stats you would have to have per-dataset tracking, which would get kinda gnarly. You’d have to update an on-disk DB with every dedup’d TXG.
And you still wouldn’t really be able to have “per-dataset” statistics, because what happens when a block is shared between two datasets, but you query each dataset individually for its dedup stats? There’s only one “extra” copy of the block–but do you credit that to the first dataset that stored that block, or to the second, when it comes to “%dedup” stats?
I agree with you that it’s unintuitive at first glance, but once you start thinking about the extra IOPS necessary to separately track this stuff, IMO it starts to make a lot more sense why the DDT is pool-wide even though you can turn dedup on or off on individual datasets. It’s all about what would require extra operations to track separately, vs what doesn’t.