ZFS not using cache for reads?

Hi all, this is my first post here and I need some help…

I’ve created a ZFS pool (just for testing) in order to find out the perf characteristics of a certain setup. I’ve created a pool of 1 HDD and later added an SSD for log and cache (ZIL and L2ARC if I follow that correctly). My SSD if partitioned in two. First partition is 4GB used for log device. The other partition in 64 GB used for cache.

I’ve wanting to testing it now on a ZFS subpool with recordsize=8k and compression=lz4. I’m using fio for testing 8k random reads and writes. I noticed that writes are much improved, as expected when using an SSD for log, with IOPS reaching ~150k writes/sec. However, my read IOPS are very low, ~200 reads/sec which is more in line with the HDD itself. Now, since I have a 64 GB cache device on this pool (and I’m doing fio tests with test file of 4 GB size) I was expecting that reads would go through ARC (RAM) and L2ARC (the 64GB cache device) and that I would also get much improved reads on this setup. But I’m not getting that.

Does anyone have an idea why ?

You haven’t said how much RAM you have, but I suspect it was using the RAM to the best of its ability. The L2ARC is another story entirely. And it’s a long, complicated story that boils down largely to “this doesn’t work the way you think it does, and isn’t half as cool as you think it is” unfortunately.

The short version is that the L2ARC does not hold everything that isn’t in ARC, by a long shot. Nor is it actually an ARC at all. The L2ARC is a dead-simple bone-stupid FIFO cache (not even LRU), which is fed at a heavily throttled maximum rate from blocks which might get evicted soon from ARC.

An fio test isn’t generally going to do a good job feeding the L2ARC. Typically, either the fio job size is small–in which case the ARC will retain the entire working set, and the L2ARC will have nothing to do–or the fio job size will be much larger than ARC, in which case it’ll explosively flow through it without hanging around for long enough to get much of it into the L2ARC. Either way, you’re looking at (very) low hitrate city.

The L2ARC’s job is, essentially, to play cleanup behind the ARC. There is a non-zero chance that blocks recently evicted from ARC will be in the L2ARC, in which case you can maybe pick up another few percent of total cache hitrate by leveraging L2ARC. But you’re pretty much always going to be looking at ratios like >80% for ARC, <10% for L2ARC, at best.

I strongly recommend reading through the L2ARC explainer I wrote for Klara Systems, if you’ve still got questions: OpenZFS: All about the cache vdev or L2ARC | Klara Inc

hoping to get some benefit from reads

Reads are a bit of a complicated story under ZFS. Because it’s copy-on-write, ZFS generally greatly reduces effective fragmentation in most real-world workloads: odds are pretty good you’ll want to read things in roughly similar large groups to the way you wrote them, and since it’s copy-on-write, those groups get written mostly sequentially, which means fewer seeks when reading them back in a similar order.

But fio isn’t a real-world workload; it’s an entirely randomized (assuming you’re using randrw, randread, and randwrite, which you should) workload which will tend to minimize the beneficial impact of copy-on-write data ordering significantly as opposed to real-world workloads, because there is essentially no chance the working set will be read back from disk in anything faintly like similar patterns as the way it was originally written.

So, when you’re using fio, you’re not getting the typical real-world benefit out of either ARC or L2ARC, and also getting bitten much harder by seeks than you normally would. In the L2ARC’s case, that usually doesn’t matter much, because it’s such a niche utility vdev in real life also. In the ARC’s case, it can be pretty badly misleading, especially for non-database workloads. (But even databases will frequently tend to read rows in similar patterns as the way those rows were written, making even that heavily random-access workload benefit more from both copy-on-write and the ARC itself in the real world than in a fully-randomized synthetic test).

It gets even more complicated if you want to talk about read specifically in the context of RAIDz vs conventional RAID, but this is probably enough to chew on for the moment.

Hi Jim ! And thanks a lot for time you’ve taken your for this answer.
Reading though your answer and the link you’ve provided it’s become more clear to me how ARC and L2ARC work. And how different they are in their implementation.

Yes it’s true I’ve been thinking about L2ARC as something that’s evicted from ARC, but in rather simplistic terms.

About fio any my tests. Test were done on an 8K recordset, because my roughest workload is postgres and I test for that. I expect to see a lot of random reads there, because… multi-tenant and all using the DB in parallel. So quite specific workload actually, where I expect crazy random reads. Well, actually, it was never planned for that DB to run on anything but pure SSD - but I had the chance and I wanted to see how ZFS would cope in “HDD with SSD accelerated” combination. And when I was going for a test I just used my previous fio command, that I used for postgres tests. In reality, postgres will not run on such setup so maybe I should have tested for some generic VM workload instead (a much more likely use case for such HDD + SSD setup).

On this test PC I have just 4 GB of RAM - that’s total, so that’s both for OS (Proxmox 8) and ZFS. From your post I learned that L2ARC eats RAM for its index and quickly calculated that with my 64GB cache and 8K recordset I could be spending 560 MB of RAM just for L2ARC index. Knowing ZFS would by default limit itself to using half of RAM available, then deducing this I end up with some 1GB+ for ARC tops. My test file is 4 GB so that also explains a few things.

Some lessons I learned:

  • I finally understand the all-present “give ZFS more RAM” mantra. But now I better understand why that is
  • generally forget L2ARC (unless really specific workload + I can prove with arc_stats that I could gain some benefit by trashing SSD storage to it)

Basically I went for L2ARC without any justification. I believe I was seeing it as something like “hey, this looks like a hybrid drive, only RAID-capable and implemented in software → let’s see how well tat works”. Wrong expectations, I guess.

Thank you !
Got any more good links for me ? Need to learn more.

I don’t have an answer for you, but I learned a few things researching your question that perhaps could be helpful.

  1. ZFS caches writes in the ARC and L2ARC to accelerate future reads in addition to caching reads. This was actually surprising to me. I assumed blocks are only cached by the ARC and L2ARC when they are read.

  2. It’s possible to turn on persistent L2ARC, but by default it is not persistent because it increases boot times.

  3. arc_summary provides a lot of useful information about the state of the ARC, L2ARC, and ZIL, including how much data is currently cached.

  4. To clear the ARC without exporting and re-importing your pool, you first need to change a default setting: echo 0 | sudo tee /sys/module/zfs/parameters/zfs_arc_shrinker_limit (To restore the default setting later, do: echo 10000 | sudo tee /sys/module/zfs/parameters/zfs_arc_shrinker_limit.) With zfs_arc_shrinker_limit set to 0, you can then clear the ARC with echo 3 | sudo tee /proc/sys/vm/drop_caches. (Note that it will still contain a few MBs.)

  5. To ensure a file is in the ARC, you can just echo file > /dev/null. Getting a specific file into the L2ARC is not so straightforward. Blocks will only be written to the L2ARC when they are evicted from the ARC, which will not happen until it fills up.

I was able to show the effectiveness of the ARC in the following terminal session:

# Create a 2GB file of random data
# I'm not sure why, but if I use `/dev/zero`, clearing
# ARC does not result in any slow down.
➜ dd if=/dev/random of=rand bs=4k count=500000

➜ echo 0 | sudo tee /sys/module/zfs/parameters/zfs_arc_shrinker_limit    
[sudo] password: 

# With rand in the ARC, it takes 0.442 seconds
➜ time cat rand > /dev/null
cat rand > /dev/null  0.03s user 0.41s system 99% cpu 0.441 total

# Clear the ARC
➜ arc_summary | grep "^ARC size"
ARC size (current):                                    17.6 %    2.1 GiB

➜ echo 3 | sudo tee /proc/sys/vm/drop_caches                         

➜ arc_summary | grep "^ARC size"            
ARC size (current):                                     5.8 %  692.0 MiB

# Without the ARC, it takes 7.941 seconds
➜ time cat rand > /dev/null                 
cat rand > /dev/null  0.02s user 0.78s system 10% cpu 7.941 total

# rand is back in the ARC
➜ arc_summary | grep "^ARC size"
ARC size (current):                                    22.5 %    2.6 GiB

# Restore zfs_arc_shrinker_limit to the default value
➜ echo 10000 | sudo tee /sys/module/zfs/parameters/zfs_arc_shrinker_limit    
[sudo] password: 

➜ rm rand

This is a very bad idea. Latency is absolutely critical for the ZIL (which is on your LOG vdev), and you’re going to drown it in L2ARC activity by placing the CACHE on the same physical drive as the LOG.

An L2ARC that small is also vanishingly unlikely to be useful in the first place. Hit rates are extremely low on L2ARC, because 1. it isn’t actually “ARC” at all and 2. it’s only there to clean up what the proper ARC missed, which isn’t often very much. If you can’t implement a large L2ARC, there’s usually no reason to implement one at all.

Thank you for your answer.
I use fio to test - fio first creates a temp file (4 GB in size, in my case) and then begins a random read test on that file. Since I’m not doing anything else on that computer, the test file should already be in ARC. Or, if not in ARC if it’s too big, it should then be in L2ARC. So reads should happen either from RAM or from SSD. But I only get about 200 read IOPS. That doesnt make any sense to me. Like ZFS is not working as it described.

Is it because I am doing sync reads maybe ?

Thanks for arc_summary I will check it out.

Please understand that I’m doing this for a test.
When I’m doing read test with fio, there are no writes on the pool. So no latency trashing on SSD. This is a controlled, synthetic test. Also my file is 4 Gb large and my cache is 64 GB. Why would that be too small ?

All in all this doesn’t answer the question: why is this ZFS setup so slow in read tests. I was expecting ZFS to use RAM and SSD cache for reads as well. But it seems it doesn’t do it. I came here to ask if anyone knows why that is. Three’s plenty of generic advice on the Internet - I’m not looking for those.

I am doing this test to find out what ZFS does and what it doesn’t do. Sync writes are excellent. I was hoping to get some benefit wih reads as well. Then I would move to a more realistic test and perhaps I would find out another bottleneck, perhaps also the one you are describing - but I’m not there yet. With 200 read IOPS and without understanding why that is happening, I have no plans to expand my testing to a more complicated workload, as that would only make things much more dificult to understand. So far, I still have no idea why my synthetic read tests are so bad.