I have one of those questions where the answer requires extensive benchmarking. I will do it. In the meantime, does anyone have an educated guess about ZFS (on Linux) and BitTorrent in terms of ARC? My workload consists of torrent data that seeds with a random access pattern, and the amount of data on disk exceeds my available RAM for ARC. L2ARC does not exist. I thought it might make sense to disable ARC altogether and use ZFS’s direct I/O feature. However, this should also depend on the BitTorrent client implementation. For example, Transmission has a flag that enables prefetching. This feature tells the OS which data to prefetch. Would this make ZFS’s ARC beneficial for performance? Any thoughts? Thanks!
For the most part, the ARC is only going to be useful to the extent that you have “hot spots” in your data, since you say the data is wildly in excess of the amount of RAM you have.
In theory, you might conclude that this means the ARC is useless to you. In practice, I suspect you will discover that you’ve got enough “hot spots” in your data to make the ARC very much worthwhile–although you may not be able to predict ahead of time which bits will wind up being the “hottest.”
It makes sense that one of the newly released Linux ISOs would be a “hot spot” because more people want it. However, would ARC be smart enough to recognize the less than 1% that can efficiently fit in RAM, or would additional tuning be necessary? ARC seems to be a mix of MRU (most recently used) and MFU (most frequently used). Would adjusting ARC exclusively toward MFU improve performance without sacrificing precious RAM capacity for items that are randomly accessed and unlikely to be requested again soon?
Upon further review, it appears that ARC can adequately balance MFU and MRU based on past cache hit success ratios. Therefore, it seems that no additional tuning is necessary. Any thoughts? Could this be one more classic example of how tuning can cause more harm than good?
Yeah, I’d say you called it.
The ARC is about as bright as a simple caching algorithm can be; in my experience if there is any pattern of hot spots large enough to benefit from caching, the ARC will get that data cached.
About the only place where you can really outperform the ARC is in situations where you’re certain you know which data should or should not be in cache based on much higher order considerations–for example, if you know for a fact that you don’t have much in the way of hot spots in a massive dataset, you might set primarycache=metadata in order to make sure that’s all that gets cached.
Even in that last example, we would expect the ARC to naturally wind up with mostly metadata in its cache, since metadata is mostly what would get repeated reads. But forcing the issue with the primarycache setting gets it to that condition quicker and more reliably.
Thanks! Thanks! Thanks!