ZFS Performance Tuning for BitTorrent

zfslover · August 26, 2024, 12:54am

I would like to know what the best ZFS settings are for a large collection of Linux ISO Torrents.

Set ashift=12
Set recordsize=16k
Set compress=lz4
Set atime=off

General questions I have about the setup:

Does compression make sense for Linux ISOs and the Big Buck Bunny sample video (I love saving that one!)? I know that compression, at least lz4, does not hurt performance much, but should I leave it on or would it be better to turn it off?
If I run BitTorrent in a Proxmox VM, the settings should not change, and 16k is still a good recommendation (for the ZFS filesystem and the VM structure)?

Is the same VM structure size and ZFS filesystem always a good recommendation?

I am aware that one of the biggest influences on performance is the general VDEV layout. Since storage capacity is more important than maximum performance, I go with a 8-wide RaidZ2 layout here.

Any other recommendations or things I should be tweaking? My focus here is on decent performance and minimal disk usage to keep my disks healthy for a long time. The workload is typical for a BitTorrent instance where there are some reads and writes.

bladewdr · August 26, 2024, 2:49am

You will see basically no difference between lz4 being vs having it off. Leave it on, is my opinion. Video files won’t compress very much, but might as well take whatever savings you can get.
Personally I don’t download much in the way of torrents anymore, but I recommend downloading to a scratch disk first to avoid increased fragmentation.
Your recordsize is too low, for something like “Linux ISOs” I would go with recordsize=1M. For VM images, 64k is a good default and can be tuned from there.

I also recommend avoiding overthinking this - the only one of these situations where you’re likely to notice any sort of difference is with the VM images - make sure that you have your recordsize set properly, and you’ll be fine.

charles · August 26, 2024, 4:57pm

I follow this same approach. I have an old 2TB HDD in my proxmox server which I use as a scratch drive for torrents. After the torrents finish seeding, they are moved to my Synology NAS. (My ZFS NAS is mirrored pairs and therefore the effective cost per TB is too high for Linux ISOs.)

For downloading, I think it’s very clear that a scratch disk is ideal.

For seeding, are there significant benefits to a scratch disk vs a ZFS RAIDZ2 or Synology Hybrid RAID?

Maybe I should switch my setup to move to the array once downloading is complete and then seed via the array.

bladewdr · August 26, 2024, 7:10pm

Seeding via the array would be fine, it’s mostly about avoiding blocks of the same file getting split up too much.

Once the data is written, it doesn’t matter. Plus, if the file is being read constantly, it would be reading it from cache anyway.

mercenary_sysadmin · August 26, 2024, 9:59pm

Large recordsize is your friend here. If you set recordsize=1M on the dataset you’re torrenting to, there’s no need to bother with a staging directory followed by a brute-force copy.

There is essentially no difference between “maximally fragmented data, but all fragments are 1MiB or larger” and “unfragmented data” in terms of storage performance or load, on a general-purpose filesystem (as opposed to something laser-focused like a tape drive).

zfslover · August 26, 2024, 10:14pm

Thanks for your response! Recordsize 1M seems like a good idea. Most of the files I download are at least 512MB or even in the multiple GB range.

What do you mean by a scratch disk? Do you mean that I should download first and move the torrents to another disk? I also want to seed, is there an automatic/smart way to do this directly in qBittorrent? How much of a concern is fragmentation with a recordsize of 1M?

zfslover · August 26, 2024, 10:18pm

Now I understand what a scratch disk is. The problem here is that I don’t think a torrent can ever finish seeding. I want to keep it as long as I have space. Can I automatically move it between disks/datasets with qBittorrent?

Also, I disagree, there is no price too high for freedom.

That sounds like a solution I would like to implement as well. Please keep me posted!

zfslover · August 26, 2024, 10:20pm

I am not quite sure what your recommendation is here. Do you recommend a scratch disk if I want to continue seeding or not? Would it be better to download torrents sequentially so that BitTorrent blocks aren’t all over the disk?

zfslover · August 26, 2024, 10:24pm

Thank you very much! So you are saying that as long as the recordsize is large (1M), everything else is not very important. Would it be better to set the recordsize even higher? Like 4M/8M? Most of my files are at least half a GB but there are also the typical small… CHECKSUM files

I also don’t have a metadata VDEV, which as far as I know should take care of smaller files.

bladewdr · August 26, 2024, 10:33pm

I’ve seen scratch disks recommended the past, but if you read my earlier comment you’ll see I specifically pointed out that where you seed from doesn’t matter.

Jim seems to be of the opinion that the scratch disk in and of itself is a waste of time, and you can download directly to the zpool, just make sure your recordsize is set correctly in advance.

mercenary_sysadmin · August 26, 2024, 11:17pm

Yes, that’s all you need to do. If you’re torrenting to a dataset with rs=1M, for all intents and purposes fragmentation stops being an issue, even on rust disks with no support vdevs at all.

If you’re running with mirror vdevs, I wouldn’t bother with anything higher than 1M. In my extensive experience benchmarking this stuff, I tend to have difficulty telling random 1MiB I/O from random 4MiB/IO, even using benchmarking utilities and looking for improvements.

If you’re running RAIDz, you might want to experiment a bit further. Remember, every block (record) you save gets saved to a single disk vdev or mirror vdev intact–but if written to a RAIDz vdev, it’s split into (n-p) pieces, where n is the number of disks in the vdev and p is the parity level.

Understanding that, a 6-wide RAIDz2 will split an incoming block into four pieces (plus two pieces of parity data), which strongly suggests that recordsize=1M on a 6-wide Z2 is closer to recordsize=256K on a single-disk or mirror vdev; extrapolating further, it seems that recordsize=4M would be ideal for that topology.

With that said, I don’t have the breadth of testing with large recordsizes on RAIDz that I do with all sizes on mirrors or single disks, so you might want to test–set recordsize=1M and torrent a fast-to-complete Linux ISO, then set recordsize=4M and torrent the exact same fast-to-complete Linux ISO. Once you’re done, do a pv < linux.iso > /dev/null for each copy, and see if there’s a significant performance difference. If there is, that’s your indication that you’d be better off with the higher performing recordsize.

charles · August 26, 2024, 11:18pm

I see. If you seed forever then you are correct. Props for seeding forever.
I seed most until a 5x ratio is hit. I figure that giving 5x what I take is a fair contribution.

Fair point. My wallet is not unlimited and for me true freedom is derived from avoiding overspending. I foresee that the combination of lowering disk prices and my personal career jumps will result in a big raidz2 array for my LinuxISOs. Especially with raidz expansion!

The only reason I use Synology is that the SHR-2 file system is so flexible with 1) varying disk sizes and 2) adding new disks to an existing pool. It’s really convenient. The synology units were too expensive for my desires, so I simply run DSM in a VM on Proxmox.

It’s very easy to implement in QBitorrent. I believe the terms in the settings page are “save incomplete downloads to:”. This is basically your scratch disk. The other path is where qbit will move after downloading is finished. It’s automatic and will even place in a sub folder based on the category used in qbit.

zfslover · August 27, 2024, 5:14am

I will do that. Thanks for the tip. I have an 8 wide RAIDz2 layout. In this regard (to make sure I understand what you are trying to tell me here) my data is split equally between 6 disks and has the aditional two disks for parity. So in my specific case, to match the record size of a single VDEV/mirror system, I would need to be in the range of 6M?

Benchmarking is always a good tip, thanks!

zfslover · August 27, 2024, 5:21am

I have a lot of free disk space and a very fast, uncapped Internet connection. So I will be distributing Linux ISOs forever.

I wish you the best with that!

I am at a storage capacity where if I really needed more, I would just double the pool size with a second VDEV. But yes, the flexibility of ZFS is/was not as great as other systems.

Thanks for the tip!

mercenary_sysadmin · August 27, 2024, 12:48pm

Recordsize is in powers of two, so you can’t do 6M. Used to be, you couldn’t do higher than 1M without adjusting kernel tunables, but I believe that limit has been raised; I’m not sure what the maximum is now, but I’m pretty sure you can get at least 4M without additional tunables.

Your eight wide Z2 isn’t an ideal width; you have to insert padding on every block because you can’t divide a block up into six pieces evenly. So it’s a bit less performant and efficient than it might be. This is less of a problem when you have highly compressible data, but I suspect these aren’t “compressible ISOs” that we’re talking about, so…

Anyway, especially given that we’re not working with ideal widths, the idea isn’t necessarily that you wind up with exactly 1MiB written per disk–the idea is more that you get “reasonably close to 1M written per disk, per write.”

bladewdr · August 27, 2024, 10:29pm

I’m sure they’re the special “viewable Linux and FreeBSD ISOs” you coined a while back.

zfslover · August 27, 2024, 11:50pm

Thanks again for the info! So reasonable enough to 1M per disk would 4M recordsize on 8 wide RAIDz2? What would be a good layout? Like recordsize=4M on 6 wide RAIDz2?

mercenary_sysadmin · August 27, 2024, 11:52pm

recordsize=4M on a 6-wide Z2 would be an ideal topology, but I don’t think it’s worth tearing down what you already built. recordsize=4M on the 8-wide Z2 you already have will almost certainly suffice for the task; you really just need to outrun 1Gbps ethernet, very likely for just one stream at a time, and you’ll be able to manage that quite easily.

zfslover · August 28, 2024, 3:50am

Good to know, thanks! So the point here is that recordsize is not the same as chunks on disk (is that the same as ZFS blocks?), but it depends on the amount of data on disks.

Also, ideal here refers to maximum capacity and performance, right? Because if you look at the whole system holistically, it might actually be far more ideal to suck up the padding but add another disk or two to the array. Is that a correct assumption/idea?

mercenary_sysadmin · August 28, 2024, 4:22am

It is not a correct assumption. Ideal means ideal; you don’t have to use an ideal width at all, but you’ll get a higher percentage of the disk’s raw capacity available if you do go with an ideal width. For RAIDz2, that’s 4-wide, 6-wide, or 10-wide.

There’s also the issue that wide RAIDz vdevs are actually worse than narrower vdevs when it comes to small blocksizes. Let’s say you need to save a 4KiB file to a 10-wide Z2: you actually wind up getting only 33% storage efficiency (a single data sector + two parity sectors), where a 2-wide mirror vdev would have offered you 50% storage efficiency: one data sector on each drive in the vdev.

What if you save a 16KiB file to the same 10-wide Z2? Now you get that promised 80% storage efficiency, right…? Well, no, because now you’re saving four data sectors plus two parity sectors: so you got the same 67% SE as a six-wide Z2.

What if you save a 16KiB file to a nine-wide Z2 instead of a six-wide Z2? More efficient because more dis…oh, you know where I’m going with this already; you still have the same 67% SE as the 10-wide, AND now you ALSO have to deal with padding inserted on any blocks large enough to actually split amongst all nine drives.

Again, if you want to ignore all this and just go “fuck it, works well enough for me,” that’s completely fine. But if you want to get the most bang for your buck, you want to set your pool up with ideal-width vdevs if possible–so your next bump up from a 6-wide Z2 is either a 10-wide Z2 or two 6-wide Z2s, not a single twelve-wide Z2.

One last time: I am well aware that Matt Ahrens advocates for “just don’t worry about it and do what you want,” and I’m not saying he’s wrong, exactly. But I do think it’s worth actually planning your storage, if you’ve got enough storage that you’re actually thinking about six or more drives in a system.