Empirically determining optimal recordsize

Palladinium · July 20, 2024, 4:35am

I’m relatively new to ZFS, I’m familiar with the basics but haven’t been using it for much more than my home lab. From what I’ve seen, it looks like recordsize is one of the more important tunables on a dataset, especially for workloads such as databases or VMs. General wisdom seems to be to match the typical size of your IO operations, which makes sense.

The problem I’m facing is finding that typical size for an arbitrary workload. There are some resources on good recordsizes for various popular workloads, but there are also lots for which there is no information. Generally, it looks like one needs intimate knowledge of a workload’s implementation to determine optimal recordsize, which isn’t always possible.

In my specific use case, I’m looking to store InfluxDB data on ZFS, but cannot seem to find much information on what typical IO looks like for this workload. I could (and probably still will) talk to InfluxDB experts and devs and look at the source code, to see if I can determine a typical write size. That’s a pretty significant amount of effort, and I might still get it wrong, not to mention some workloads will be opaque and closed-source enough that investigating its write characteristics is not an option.

I’m wondering if there’s a more generic, empirical way to determine the typical size of IO for a workload, given that you already have the ability to deploy it? Is there some program or filesystem utility (ZFS or otherwise) that logs and profiles the size of IO operations on a dataset/filesystem, so I can figure out the recordsize from a real usage scenario?

I can see the use in deploying some workloads on an untuned ZFS or on a non-ZFS storage solution until I can determine the right recordsize, and then migrating them to ZFS with decent confidence that I tuned the recordsize correctly.

waltar · July 20, 2024, 6:16am

You could “strace -p influxdb-pid” interactiv or send to file. Look for read or write at line beginning and at the end you see the i/o size, eg “read … = 65536” means 64k file read request.
A recordsize tuned own dataset for a database is a good thing but for a mixed file and application dataset like user home or project directory the default recordsize is still best. If you have a directory with lots of subdirectories filled with files of different sizes and take a smaller recordsize as default eg 64k you will eg get 10% better metadata performance (find, du, ls -la) but 20% lower read throughput (cat, dd) on big files while in the opposite if you choose a bigger recordsize like 1M you will see 10% read throughput on big files but 20% metadata performance lost, so mostly in a mixed file environment the default recordsize is the best compromise.