I'd like to know more about managing RAM with ZFS & Linux in general

I just read this post by Jim, on the pros and cons of using swap, and had a couple of semi-related questions on managing RAM allocation in general:

  1. Has Jim (or anyone else) done a similarly detailed, yet easy to follow, summary on the pros and cons of tweaking ZFS RAM allocation options (ARC min/max, etc)?
  2. I know FreeBSD has gaols, but what do people use to limit the amount of resources (RAM, etc) that a specific application can access on Linux? Is it just systemd containers, or are there other widely-used tools available?
  3. Are there any common mistakes that I should watch out for? Unknown unknowns that a non-professional like myself wouldn’t even know to ask about?

Thanks in advance.

Not really AFAIK. But I can give you a short version: the default, which is to limit ARC to 1/2 the physical RAM of the system, is a good guideline for the majority of systems. Unless you know much, much better, it’s usually a good idea to leave that setting in place, and add more RAM if you feel like your applications need more memory available to them than they currently have.

The exception would be in TRULY massively provisioned systems, eg 1T RAM. Half the RAM might repeat might be overkill in some of those cases–although it frequently still isn’t, because the massively provisioned boxes tend to have massive workloads as well.

But in any case, the sign that you might be able to get away with dialing the ARC down some is, essentially, wildly impressive hit rates. Let’s take a look at one of my boxes real quick:

me@elden:~$ egrep "^(hits|misses)" /proc/spl/kstat/zfs/arcstats
hits                            4    734530985
misses                          4    16656794

That’s a 97.7% cache hit rate! Now, on the one hand, that’s awesome! On the other hand, that’s evidence I probably could afford to drop the ARC significantly on this box, and still have good results.

me@elden:~$ free -m ; echo ; egrep "^size" /proc/spl/kstat/zfs/arcstats
               total        used        free      shared  buff/cache   available
Mem:           64075       45995       16451        1267        3592       18080
Swap:            499           0         499

size                            4    33634782712

This is a desktop system, and we can see at a glance here that it’s using that default I referred to earlier–half the system RAM, which in this case is 32GiB of 64GiB. However, we can also see that the system has nearly 16GiB of entirely free RAM–that doesn’t mean RAM not used by applications but used for cache, that means entirely unused RAM.

In other words, dialing back my ARC on this box wouldn’t help anything, because I’ve already got more RAM than I’m using!

But that’s a humble desktop. Let’s take a look at a production VM host server:

me@[elided]-prod0:~$ egrep "^(size|hits|misses)" /proc/spl/kstat/zfs/arcstats ; echo ; free -m
hits                            4    203033592
misses                          4    926054
size                            4    33281609744

               total        used        free      shared  buff/cache   available
Mem:           64204       58457        4381          16        2094        5747
Swap:           8191           0        8191

We’ve still got an outstanding hit ratio–an even better one, actually. On this server, we’re seeing a gobsmacking 99.54% hit rate! That’s so absurdly high, surely it would be a good idea to free up some of that ARC that’s eating half my physical RAM, right?

Nope, because we’re still seeing a system that isn’t RAM constrained in the first place–note that the system still has a bit more than 4GiB free RAM. That’s, again, RAM literally sitting entirely idle in the box–it’s not being used for cache, for applications, or for anything else either. So making even more RAM “available” is bad for the box, not better!

With that said, this is a production VM host–and it’s possible that I might want to add more VMs, which in turn need more virtual RAM allocated to them, and that might be a good reason to cut back on arc_max. But there’s still a devil to be found in those details, because adding a VM necessarily means adding more read operations that would need caching, which in turn suggests… if at all possible, no I shouldn’t decrease the arc_max allocation on this box, if I can possibly avoid it some other way.

3 Likes

Cheers Jim, much appreciated. Clipping this out and adding it to my reference notes.

You might want to be aware of ARC not accounted as MemAvailable in /proc/meminfo · Issue #10255 · openzfs/zfs · GitHub.

In summary, if an app tries to allocate lots of memory all at once, the ARC can be too slow to free memory and the app can be killed by the OOM killer. I personally saw this starting VMs with KVM. Proxmox ended up setting a default limit on the ARC to 10% of the physical memory, to a maximum of 16GB.

Of course, you’re not going to run into this if your ZFS hosts are dedicated to storage tasks.

Note FreeBSD doesn’t have this issue as far as I know. I expect that’s because they don’t have the licensing troubles integrating memory subsystems with the CDDL ZFS code.

Yet another terrible decision by the Proxmox devs, sigh.

The FAR better answer here is simply not to allocate more than half your RAM to the VMs in the first place. Slashing the ARC size will allow you to run more VMs, sure… as long as you don’t mind them running poorly because your cache hit rate is in the weeds.

This is not unique to ZFS, either. I typically try to reserve about half the physical RAM for filesystem cache with ext4 or NTFS, also, and have for longer than ZFS has existed. The ARC is considerably more effective than an LRU, but that doesn’t mean it needs more RAM than LRU caches do, it simply means it gets more out of it than LRU caches do.

Isn’t this workload dependent? I wouldn’t expect VMs doing mostly network IO (say, remote databases or HTTP calls) and CPU bound work wouldn’t be as affected by a smaller local disk cache.

My two Proxmox servers are getting 97-99% hit rates on the ARC, even with a cap of 10% of the system memory. I know that 97 can go a bit higher if I make the ARC 20% of memory, but it’s not a noticeable performance improvement to the apps being run.

I mean… If you don’t have a storage dependent workload, then yeah, you probably don’t care much about storage cache.

I typically try to reserve about half the physical RAM for filesystem cache with ext4 or NTFS, also, and have for longer than ZFS has existed.

Granted, I’m no I.T. pro, but I like to think I’ve read a reasonable amount, and listened to a considerable amount of material on setting up and managing computers; and yet I can’t remember hearing anybody say anything comparable to this, ever. It makes me wonder what other basic knowledge is out there that I don’t know about.

Most people aren’t willing to spend that much on RAM, and have difficulty believing that more RAM will do as much to make their computer faster as a DIRECTLY faster CPU, or hard drive, or etc. So they allocate more of their budget to the more obvious go-fast stuff, and tend to not pay very much attention to the much less sexy infrastructure that supports it.

I became an IT professional in the 1990s, when solid state drives weren’t even a gleam in the industry’s eye yet, and mechanical hard drives (or even, if you were especially cursed, FLOPPY type drives including Jaz drives and the like!) were by far the biggest bottleneck on a system.

Recognizing that the majority of the time my computer was pissing me the hell off, the hard drive was in active use, I experimented with the fastest and most expensive drives I could find… Which made very, very little difference. But I noticed when opening an app the SECOND time, it was like wizardry. It was not hard to figure out that cache was the reason why.

After that, I began spending a lot more of my personal budget on RAM, and seeing the benefits from doing so. My PCs’ performance were far more consistent than any others I got the chance to work on.

When I got the chance to begin building and selling PCs to the business I worked for, I built them to the specs I would want them… And saw that those folks, who were so non technical they tended to call the monitor the “computer” and the computer either “the hard drive” or “the modem” saw and appreciated the way my PCs performed for them the same way I appreciated how they performed for me.

Back then, it was usually just “cram in as much RAM as is literally possible” and it still wasn’t really all that much RAM. These days, it’s not that hard to build systems with VASTLY more RAM than even I think you need… Which is precisely what has given me the chance to discover that “about as much RAM for cache as you give the rest of the OS and applications” is a really effective rule of thumb.

Any more than that is typically wasted. Much less than that typically leaves significant performance on the table, especially when it comes to task latency.

This is, of course, simply a rule of thumb. Different systems have different workloads. But if you aren’t willing to directly test, benchmark, revise, test again, and keep doing so until you thoroughly understand the relationship between your specific workload and your total amount of RAM, it’s a damn good rule of thumb!

Btw, this can be misleading: those VMs are also caching themselves internally unless you’ve gone to some pains to minimize that behavior–a block the VM is caching internally will not be accessed from ARC, because the VM cache serves the block before it hits what the VM thinks of as “the metal” and the ARC has a chance to deliver it.

Many admins argue that this is better, because it’s slightly less computationally expensive to serve a cache hit from inside the VM than to serve it from the host side.

I argue that it’s better to reduce cache behavior in the VMs and allow the host to manage it–the host has ARC rather than a simple LRU so it gets much higher hit rates on the same amount of RAM, the host knows which VMs can benefit from more cache and will therefore dynamically split the cache between VMs according to their workload rather than blindly allocating this many GiB to foo and that many to bar and this many again to baz, regardless of which one needs it more. And finally, the ARC on the host survives VM reboots… The LRUs inside the VMs do not.

If the Proxmox devs subscribe to that “caching is better done inside the VM” school, they may very well have set primarycache=metadata… Which ensures all caching of data is done inside the VMs, and makes 10% of RAM for ARC on the host look a lot more reasonable, so long as you subscribe to the same (badly mistaken, IMO) school of thought about where storage cache is best served from.