Optimal network transport for zfs-send/zfs-receive?

Can anyone point me to guidelines or best practices for configuring ZFS send/receive to maximize speed over dedicated fiber between servers? The 10Gbps NICs are running near rated speed with 9k mtu (according to iperf3), but I suspect that SSH is not the quickest pipe, at least in default configuration.

  • Would something like socat be better?
  • Or, an ssh that supports the “none” cipher?
  • Enable/Disable compression in ssh and/or zfs-send?
  • Worth messing around with kernel or NIC-driver tuning options?

Before I run a bunch of experiments I thought I’d ask – doubtless others have figured this out. :thinking:

Other details: all systems are currently FreeBSD 15 RELEASE. NICs are an assortment of Intel and Broadcom. Plenty of RAM, but CPUs of varying power.

1 Like

CPU tends to be the bottleneck when you want to saturate 10Gbps or up. An awful lot of network protocols–and SSH isn’t an exception–run single-threaded, meaning you’re limited to the maximum single-threaded performance of your CPU when trying to send a single job down that 10Gbps pipe.

What you really need to do is figure out how to parallelize your processes so that you can get more than one CPU thread at a time involved in the send.

Indeed: some rough tests with iperf3 suggest 4-6 threads are needed to fill the pipe. Getting rid of the crypto might help; or, i suppose, getting faster computers :wink: None of my CPUs are exactly top of the line.

One can imagine a simple streaming split-merge algorithm to do the job, which I could probably code up myself, but would rather not, since surely it has been done…?

Github streamsplit is close to a proof of concept (but also, the author admits, an excuse to learn rust) Any shop that needs to move big data around would seem to benefit from this, ZFS or not. Please tell me somebody has built a maintained FOSS utility for this job. Maybe OpenZFS could integrate such a tool.

2 Likes

It’s an old article/paper but this got me thinking about how the Twitter engineers (back when it was Twitter) came up with a system using BitTorrent for another similar-ish scenario called murder.

Perhaps more appropriately to this use case, I wonder whether bbcp might fit the bill?

There’s some docs on the repo, but also a science laboratory that uses it provides some more examples.

Good luck!

Depending on how man filesystems your source has, you could send multiple in the background. Your send script would have to have a plan to accommodate for middle-branch filesystems that would have to be transported non-recursively.

1 Like

Very interesting idea :dizzy:

The code hasn’t been touched in a decade, but with minor changes it will compile and run in FreeBSD 15 (and Linux)–a sign of thoughtful coding.

In preliminary tests, a large file that took 26 seconds with a simple scp, copies in 4.5 seconds using bbcp with 5 threads. (in my test case that seems to be the optimal count–larger or smaller thread counts will yield poorer performance). Not full capacity, but that’s a very useful speedup!

I have not yet figured out how to trick it into accepting a pipe/stream input. There are a bunch of stat() and seek() calls in the source, and it is checking some file sizes, which might mean it won’t work with streams of unknown size…will continue to study.

3 Likes

I am very interested in your continuing study.

1 Like

Actually, it turns out the crafty authors of bbcp thought of this use case, and provided command line options to support it. The documentation (bbcp -h) is, uh, terse but accurate. I was able to dd random numbers through it at a rate that nearly filled the link. ZFS send/receive also works fine, and I see a significant speedup: a send that took 260 seconds conventionally will finish in about 45 seconds with bbcp. However, this is well below the capacity of the private wire, and does not scale with more channels. Tentatively concluding that something else (maybe my slower SATA SSD on one end :roll_eyes: ) is a limiting factor. YMMV. Perhaps somebody with moderne hardware could properly benchmark.

#!/bin/sh

SRCPROC="zfs send zroot/test@bbtest"
DSTPROC="root@europium10g:zfs receive zfast/test"

# compression:  0 seems best, ymmv
CP=0

# network channel count. Optimal number varies.
MP=6

echo "bbcp send with $MP streams and compression level $CP"
echo "   from:  $SRCPROC"
echo "     to:  $DSTPROC"

#  -N i  :   input is a program       -N o : output is a program

time /usr/local/bin/bbcp -v -c $CP -s $MP -4 \
    -T "ssh -4 root@europium10g /usr/local/bin/bbcp" \
    -N i  -N o  "$SRCPROC"  "$DSTPROC"

Note: ssh compression is disabled in .ssh/config.

I suspect there is more speed that someone wilier could wring out of this. Also conjecture: people who move big data around day and night are using even better tools that I just don’t know about, yet.

2 Likes

What processor do you have? Generally speaking you probably shouldn’t expect to get better results than you get from [core count+1].

Note that’s CORE count, not thread count. Hyperthreading won’t help with this task, and you don’t want to overload the CPU and force unnecessary context switching if you can avoid it.

Incidentally, this doesn’t look like a ready to go solution for your use case, but you might still find it of relevant interest: https://dl.acm.org/doi/fullHtml/10.1145/3569951.3597582

Hi

I have read about a tool called mbuffer(1) that accepts multi-threaded and block size and can be used over the network. I haven’t tested it since SSH is enough for me. But you can look for more information about it.

Syncoid actually uses mbuffer to accelerate connections by default. But it uses mbuffer beneath SSH for obvious security reasons, so mbuffer accelerates by way of literal buffering–so that if you’ve got some periods where the storage outruns the network and others where it’s the other way around, you’re not slowed unnecessarily–but it can’t do anything about parallelism.

Hi @mercenary_sysadmin

I wasn’t familiar with the Syncoid tool, but to replace SSH you could use a dedicated VPN that doesn’t have more bandwidth, to take advantage of the mbuffer.

Edit: I’ve seen that the community has a category for the application mentioned, I’ll read about it.

This is an esoteric reply … The nature of TCP traffic is extraordinarily forgiving, and only certain situations can fill a pipe with a single TCP stream. Parallel downloading (or in some circles, multicon) convinces a group of connections to negotiate as much bandwidth as possible.

The overall slowdown I’ve seen with the zfs send process is the receiving side pause to sync the snapshot metadata to permanent storage. Using options like --no-stream are a useful way to avoid pausing thousands of time for a stream send.

Interesting. There seem to be a number of options I’ve not experimented with yet. I’m in the process of getting better test machines. If I see any interesting results, I will post on this forum. The SATA-3 (6 Gbps, or whatever Supermicro’s 2nd-best motherboard’s SATA controller can muster) is likely one of the anchors I’m dragging along in the current test jig. Thanks to all who responded for your suggestions!