Playing around with dedup

I have a directory in a dataset that has nearly 1TB of files nested within it. I happen to know that there is over 100GB of duplicate files in it. I started reading up of rmlint to hardlink the duplicates together. But, then I thought I’d see what fast dedup could do (it would certainly be easier and less error prone!).
So, I created a new dataset, set deduplicaion=true, and ran cp -a old-dir new-dataset.

Here’s my question: How can I find the deduplication statistics for my new dataset? I can only find info for the zpool as a whole using zpool list or zdb -DD. The zpool level info isn’t very useful to me. If we can configure dedup on a dataset level, shouldn’t we be able to view dedup info an a dataset level too?
Am I missing something?

[quote=“hayays, post:1, topic:4468”]
How can I find the deduplication statistics for my new dataset?[/quote]

You can’t, because that’s not how dedup works. Dedup is pool-wide, not per dataset. If you download a copy of Huckleberry Finn and place it in pool/ds1, then cp it to pool/ds2, it will be deduplicated (if dedup is on in pool/ds2).

But the deduplicated blocks belong to multiple datasets. Neither pool/ds1 nor pool/ds2 have multiple copies of the same block, so neither dataset would have any “deduplication” if you tried to look at it on a per dataset basis–but there’s only one copy of each block, and each block is used in both datasets.

Now imagine how you’d try to track dedup on a per dataset level anyway, despite the fact that ten datasets might all share a single block with 0, 2, or 100 additional copies of that block per each dataset that block lands in.

That might be doable if you very specifically designed the feature that way from before you ever wrote a single line of code. But as a bolt-on later, it seems unlikely. You’d have to keep and update per-dataset counters, as far as I can tell at least, and every extra write is an extra IOP spent that you don’t have available for something more important.