Does anyone know how to get a count of how many records on a volume are 1M vs 128K vs 64K etc? It would be nice to know to help calculate dedup efficiency.
Try zdb -DDD zpool
This is likely to take a long time depending on the size of your pool.
You can start with a single D and then go all the way to (I think) 5 Ds.
I too had this problem.
I received a data stream with 1M record size, but due to delegated permissions of the receiving user (I cannot use root over ssh) the process was unable to set the record size to 1M, instead opting for default. I guessed that it would have transferred as the 1M size, but failed to set the property of the new dataset due to permissions issue. I went off to research how to find the current record size but with limited time felt it was too difficult.
zdb
— display ZFS storage pool debugging and consistency information
And the D flags
-DD
Display a histogram of deduplication statistics, showing the allocated (physically present on disk) and referenced (logically referenced in the pool) block counts and sizes by reference count.
-DDD
Display the statistics independently for each deduplication table.
-DDDD
Dump the contents of the deduplication tables describing duplicate blocks.
-DDDDD
Also dump the contents of the deduplication tables describing unique blocks.
The zdb -DD <pool>
command doesn’t have the info I’m after:
> zdb -DD poolb
DDT-sha256-zap-duplicate: 477751 entries, size 449 on disk, 145 in core
DDT-sha256-zap-unique: 1106910 entries, size 461 on disk, 148 in core
DDT histogram (aggregated over all DDTs):
bucket allocated referenced
______ ______________________________ ______________________________
refcnt blocks LSIZE PSIZE DSIZE blocks LSIZE PSIZE DSIZE
------ ------ ----- ----- ----- ------ ----- ----- -----
1 1.06M 987G 932G 932G 1.06M 987G 932G 932G
2 355K 316G 276G 276G 776K 689G 599G 599G
4 95.8K 74.4G 66.0G 66.0G 468K 355G 313G 313G
8 13.0K 9.33G 8.00G 8.01G 126K 88.9G 76.4G 76.5G
16 2.35K 611M 472M 474M 45.5K 11.7G 8.91G 8.95G
32 161 50.3M 36.4M 36.5M 6.62K 2.17G 1.55G 1.55G
64 25 5.36M 2.73M 2.77M 1.97K 369M 191M 194M
128 4 988K 178K 188K 725 145M 26.4M 28.3M
256 6 2.02M 2.02M 2.03M 2.15K 663M 663M 666M
512 1 512B 512B 4K 526 263K 263K 2.05M
1K 2 1K 1K 8K 2.10K 1.05M 1.05M 8.38M
2K 1 512B 512B 4K 2.89K 1.44M 1.44M 11.6M
Total 1.51M 1.36T 1.25T 1.25T 2.45M 2.09T 1.89T 1.89T
dedup = 1.51, compress = 1.11, copies = 1.00, dedup * compress / copies = 1.67
It’s good info on how many blocks are dedup’d, but doesn’t tell me anything about the size of the blocks.
Ok, I wrote a little script to decode the lines that 4 D’s or 5 D’s outputs, lines like this:
index 102a4a57d9721 refcnt 2 single DVA[0]=<0:74ade79000:100000> [L0 deduplicated block] sha256 uncompressed unencrypted LE contiguous dedup single size=100000L/100000P birth=26733L/26733P fill=1 cksum=2a4a57d97213bec:fa87b2e2d221d020:e76669454a546032:8b9662771d79f00d
In this line I believe the size is represented as size=<before compression in hex>L/<after compression in hex>P, this example block being 0x100000 bytes (aka 1MiB) in size, before and after, as it’s uncompressed.
My understanding is there is one line for each block in the dedup table with 4 D’s, and one for every block with 5 D’s, but I’m not actually 100% sure on this so I wouldn’t trust my guess. However, the output of my mostly deduped zpool with 5D’s is (trimmed for length):
b'a8200L' 8
b'df600L' 8
b'bb000L' 8
b'fda00L' 9
b'dbe00L' 9
b'a0800L' 9
...
b'1000L' 2609
b'e00L' 2719
b'c00L' 3543
b'800L' 4114
b'a00L' 4470
b'600L' 5158
b'400L' 6046
b'200L' 11254
b'100000L' 1404995
As you can see, 0x100000 (1MiB) blocks are by far the most common, with 1.4 million of them, which is what I would expect as this pool is mostly pictures and videos. Next most common are all the small block sizes, like 0x200 (512 bytes), 0x400 (1024 bytes), etc.
Script is:
#!/usr/bin/env python3
import subprocess, re, collections
# index 102ca703768b3 refcnt 3 single DVA[0]=<0:6002d00000:100000> [L0 deduplicated block] sha256 uncompressed unencrypted LE contiguous dedup single size=100000L/100000P birth=956L/956P fill=1 cksum=2ca703768b310af:bd95efb760577eb1:9fbd3ffae69d2b0e:d18504de45e66b20
p = subprocess.Popen(['zdb','-DDDDD','poolb'], stdout=subprocess.PIPE)
c = collections.Counter()
for i, line in enumerate(p.stdout):
#if i > 1000: break
if m := re.match(rb'^index.*size=([^/]*)/(\S*).*', line):
c[m[1]] += 1
for k,v in sorted(c.items(), key=(lambda i: i[1])):
print(f'{str(k):8}\t{v}')
Hmmm… my bad (I think) - I suspect I got “side-tracked” by your mentioning dedupe and so jumped straight into the dedupe tables and their dump ( -DD…)
Try zdb -Lbbbs zpool
(I just ran it quite quickly against a pretty empty pool) and it does give you the block size histogram which is probably exactly what you need.
With a little scripting you should get what you can derived out of the dump.
Also, just as an aside, this will/should(?) also help you estimate/calculate the metadata usage in the pool in case you wan to use a special dev.
I hope this time it really helps you.
That has worked great, thanks!
The block size histogram has exactly the info I was after.
Block Size Histogram
block psize lsize asize
size Count Size Cum. Count Size Cum. Count Size Cum.
512: 44.2K 22.1M 22.1M 44.2K 22.1M 22.1M 0 0 0
1K: 34.2K 41.3M 63.3M 34.2K 41.3M 63.3M 0 0 0
2K: 34.1K 91.6M 155M 34.1K 91.6M 155M 0 0 0
4K: 418K 1.66G 1.81G 98.7K 451M 606M 132K 529M 529M
8K: 82.6K 840M 2.63G 46.0K 521M 1.10G 464K 4.01G 4.53G
16K: 54.8K 1.17G 3.80G 127K 2.29G 3.39G 66.9K 1.33G 5.86G
32K: 53.9K 2.40G 6.20G 42.0K 1.88G 5.27G 56.0K 2.43G 8.28G
64K: 64.0K 5.75G 12.0G 42.5K 3.82G 9.09G 65.7K 5.85G 14.1G
128K: 88.3K 16.2G 28.2G 302K 40.3G 49.4G 89.0K 16.3G 30.4G
256K: 165K 61.4G 89.5G 56.7K 21.1G 70.5G 165K 61.5G 91.9G
512K: 194K 133G 223G 49.1K 35.5G 106G 194K 133G 225G
1M: 1.69M 1.69T 1.91T 2.04M 2.04T 2.14T 1.69M 1.69T 1.91T
2M: 0 0 1.91T 0 0 2.14T 0 0 1.91T
4M: 0 0 1.91T 0 0 2.14T 0 0 1.91T
8M: 0 0 1.91T 0 0 2.14T 0 0 1.91T
16M: 0 0 1.91T 0 0 2.14T 0 0 1.91T
As expected, most of my blocks (2.04M before dedup, 1.69M after) are 1M in size, with then a scattering of other sizes.