Really slow incremental receive on one dataset

Joghurt · August 13, 2024, 12:33pm

I have a few datasets that I copy to my server/backupbox using syncoid, and it worked well for a year. I now have a dataset where the incremental receive is really slow. A full receive is around 120 MB/s, but an incremental receive on that dataset is around 100 KB/s (that’s not a typo: KB, not MB!). It’s only this one dataset, all others work fine.

It’s definitely the receive; I can manually run zfs send and copy the file over with the expected performance, and receiving that file then has the same problem. The disks are also constantly writing, but they are not completely busy, I can still run a fio in parallel while the receive is running and get 75MB/s random write.

Some googling suggested setting zfs_per_txg_dirty_frees_percent to 0 might help, unfortunately, it doesn’t in my case.

Any idea what I can do to debug this further?

Some more context about this dataset, as it might be important: I recently changed the hierarchy of this set, previously, it was “zroot/home”, but recently changed it to “zroot/home-old”, created a new “zroot/home”, and then renamed the old to “zroot/home/my_username”, so I changed the hierachy where it is located. I don’t think it should make a difference, but that’s the only thing in this dataset that sets it apart from the other ones.

Drives are CMR, and I have an optane SSD as cache and log device. Pool is around 70% filled. I have dedup on one small dataset on the backup box, but only that one, and the received datasets have no dedup at all.

tenkara_vt · August 13, 2024, 5:14pm

I’m relatively new to ZFS myself and still learning, so take this with a pinch of salt.

However, the following recent topic shares a few similarities with your situation and I wasn’t sure if the solution there would work for you.

Joghurt · August 13, 2024, 5:23pm

Hi, thanks for the reply, unfortunately, their problem is different from mine. They couldn’t run an incremental, I can, it’s just really slow.

Meanwhile, I’ve tried to analyse it further, and the issue seems to be excessive transaction groups being synced.

# cat /proc/spl/kstat/zfs/zroot/txgs|head -n20                                                                                                   [19:20:38]
txg      birth            state ndirty       nread        nwritten     reads    writes   otime        qtime        wtime        stime
1918777  68027989851663   C     0            0            0            0        0        211146916    2410         56258        258138
1918778  68028200998579   C     671744       0            5345280      0        288      426666       1698         60950        246510270
1918779  68028201425245   C     0            0            0            0        0        246607982    2507         57287        89935
1918780  68028448033227   C     0            0            0            0        0        155822       1492         15265        125785
1918781  68028448189049   C     675840       0            5300224      0        280      227310       1044         11835        281486189
1918782  68028448416359   C     821760       0            3690496      0        217      281543975    2482         53782        259753261
1918783  68028729960334   C     0            0            0            0        0        259855371    2729         62541        250013
1918784  68028989815705   C     671744       0            5197824      0        277      450828       1723         71885        227777832
1918785  68028990266533   C     0            0            0            0        0        227895617    2132         53864        99418
1918786  68029218162150   C     0            0            0            0        0        201026       1613         14067        124608
1918787  68029218363176   C     672768       0            5427200      0        269      254651       1932         10405        294505786
1918788  68029218617827   C     1509888      0            4071424      0        268      294564464    2227         57482        187220068
1918789  68029513182291   C     0            0            0            0        0        187288696    2875         16047        271873
1918790  68029700470987   C     671744       0            5148672      0        276      361473       2063         14880        229097204
1918791  68029700832460   C     0            0            0            0        0        229144905    2612         50234        196360
1918792  68029929977365   C     0            0            0            0        0        276763       1727         14777        245286
1918793  68029930254128   C     672256       0            5140480      0        278      348728       1603         15825        247778275
1918794  68029930602856   C     1493504      0            4145152      0        237      247824888    2278         16334        273379349
1918795  68030178427744   C     0            0            0            0        0        273436752    2220         57571        247808

so currently, the “longest” a txg is open is around 300 ms, while some get synced after 300 µs, no wonder it’s so slow. But the thing is, I don’t know what I can do about it.

Halfwalker · August 13, 2024, 5:27pm

No issues with the drives in the pool ? Any of them SMR type rather than CMR ? SMR has been known to utterly clobber zfs performance

tenkara_vt · August 13, 2024, 5:36pm

No problem. I just didn’t know if their solution would improve your slow send.

Joghurt · August 13, 2024, 6:23pm

No, as written in the post, all drives are CMR and I can write with 70MB/s while the receive is going on. All other receives are fast, it’s just this one that’s problematic for some reason

mercenary_sysadmin · August 13, 2024, 8:34pm

What is the recordsize on that dataset?

How much free space remaining does your pool have?

How many snapshots do you have of that specific dataset?

Joghurt · August 13, 2024, 9:19pm

Recordsize is 128K (though it’s possible that it was a different value in the past, but unlikely). Target dataset currently has 1M, but I’m relatively sure I also tested it with 128k (but if that could be a reason for the issue, I could double check)

Target pool has ~30% space remaining

That dataset has 70 snapshots, not including the first one that was already transferred (at normal speed). I also tried manually doing an incremental send without sanoid sending only the next snapshot, but it doesn’t make a difference. Complete dataset is around 500G, target pool is a ~22TB raidz2 with ~6T free (as reported by zpool list)

mercenary_sysadmin · August 13, 2024, 11:47pm

Target dataset recordsize doesn’t matter, replication won’t rewrite blocks to a different size than they were originally.

If you ZFS send | dev null, your speed is good, yeah?

How many snapshots do you have on the target?

Another thing to try: zfs send -i source@newestcommon source@nextsnapafternewest common > snapshot.inc ; after that scp snapshot.inc to the target, after that cat snapshot.inc | zfs receive on the target.

How’s your write speed during the scp?

How’s your write speed when doing the receive from the local hosted file?

Joghurt · August 14, 2024, 9:25am

Thanks for helping out.

For the record, the transfer has now finished after 12h for 30GB, and it looks like syncing the remaining snapshots to that dataset have the expected/normal speed. However, I’m willing to do a rollback and start it again if it helps figuring out what goes wrong in case it happens again. Let me know if that would be useful.

Regarding your questions:

$ zfs list -t snapshot | wc -l
2811

if by target you mean the target dataset, that one only had one snapshot before the problematic receive.

I did a zfs send -I @newestcommon source@mostrecent (I also did -i @newestcommon before, with the same results), and sent that file over with scp so I can retry (and be able to let it run overnight without having the PC turned on). The speeds were normal, like ~104 MB/s, that’s basically maxing out my network

Write speed during scp I haven’t tested, during the receive I ran “fio --name=fio --filename=foo --size=1G --rw=randrw --loops=10 --numjobs=5” on the server and it had around 70 MB/s.

I should probably also mention that the first snapshot in the problematic receive is 1 month later than the one on the server; it’s still just around 20 GB total data. I did run a zstream dump -v on the dataset, and there seemed to be quite a lot of CLEAR(?) commands issued, but I don’t have the data right now. I can rerun it though if it could be useful information.

One more thing I noticed just now, probably not important: I’m currently running a syncoid task and as I said, it seems to transfer the new snapshots with the expected speed (30 MB/s in this case because I’m on Wifi atm), but if I run zdump stream received_problematic_stream on the server at the same time, the receive basically stalls. If I kill zdump, it recovers.

Joghurt · August 14, 2024, 12:14pm

Summary of zstreamdump:

SUMMARY:
        Total DRR_BEGIN records = 9 (0 bytes)
        Total DRR_END records = 10 (0 bytes)
        Total DRR_OBJECT records = 7810607 (1377109984 bytes)
        Total DRR_FREEOBJECTS records = 521038 (0 bytes)
        Total DRR_WRITE records = 809384 (25142392320 bytes)
        Total DRR_WRITE_BYREF records = 0 (0 bytes)
        Total DRR_WRITE_EMBEDDED records = 0 (0 bytes)
        Total DRR_FREE records = 7932818 (0 bytes)
        Total DRR_SPILL records = 21 (10752 bytes)
        Total records = 17073887
        Total payload size = 26519513056 (0x62cafa7e0)
        Total header overhead = 5327052744 (0x13d845fc8)
        Total stream length = 31846565800 (0x76a3407a8)

mercenary_sysadmin · August 14, 2024, 4:51pm

What’s the topology of the target pool look like? Mirrors or RAIDz? How wide? How many vdevs? How about the source pool?

Would there likely be a lot of small files present in that dataset?

We’re getting fairly deep in the weeds, but I’m starting to potentially smell some surprises in how RAIDz handles small files. Replication is block-level, but small files mean undersized blocks, which can potentially put you right back into rsync-like territory with serious IOPS/fragmentation issues on large numbers of small files.

Joghurt · August 14, 2024, 5:12pm

It’s a 4-disk RAIDz2 (not ideal, I know, small home server that only has 4 ports) with an Optane drive as cache and log device.

the dataset in question is my home directory, so it’s likely that there are quite a few small files (.config). I can check later and create some statistics.

which can potentially put you right back into rsync-like territory with serious IOPS/fragmentation issues on large numbers of small files.

If that were the case, wouldn’t that also mean the initial full sync would be slow? During the problematic incremental receive, I had like 10 or more txg commits per second; I’m not surprised that slows the receive down, but I don’t understand why they happened only there (again, full receive and the incremental receive today after the problematic one were ok)

mercenary_sysadmin · August 14, 2024, 5:28pm

That would depend on whether the initial snapshot replicated fully had large numbers of small files in it, as opposed to the large number of small files being created after the initial snapshot was taken. This is also still very theoretical; I’m reaching my way towards your answer in the way one might search for a lump of coal in a cellar with a flashlight.

During the problematic incremental receive, I had like 10 or more txg commits per second; I’m not surprised that slows the receive down, but I don’t understand why they happened only there

A ton of txg commits per second with extremely underwhelming throughput in the same time frame can’t really be explained any other way than “each of those commits is very small.” That seems most likely to be from small files–although if, for instance, you had been experimenting with tiny values of txg_commit_interval and related tunables after the first snapshot was taken, that might also account for it.

Joghurt · August 14, 2024, 6:55pm

I’ve made a small script to group the file sizes. On the left is the first, full snapshot, on the right is the first snapshot of the incremental which already caused problems.

autosnap_2024-06-01_00:00:10_monthly autosnap_2024-07-03_21:41:30_monthly
≤           1 bytes:  18,828         ≤           1 bytes:  19,527
≤           2 bytes:   1,185         ≤           2 bytes:   1,185
≤           4 bytes:   2,677         ≤           4 bytes:   2,569
≤           8 bytes:   3,173         ≤           8 bytes:   3,467
≤          16 bytes:   7,333         ≤          16 bytes:   8,108
≤          32 bytes:  20,696         ≤          32 bytes:  21,332
≤          64 bytes:  85,449         ≤          64 bytes:  87,202
≤         128 bytes: 100,749         ≤         128 bytes: 104,293
≤         256 bytes: 160,033         ≤         256 bytes: 166,926
≤         512 bytes: 234,470         ≤         512 bytes: 234,486
≤        1024 bytes: 307,388         ≤        1024 bytes: 307,700
≤        2048 bytes: 432,154         ≤        2048 bytes: 431,440
≤        4096 bytes: 345,148         ≤        4096 bytes: 336,496
≤        8192 bytes: 347,258         ≤        8192 bytes: 347,291
≤       16384 bytes: 217,849         ≤       16384 bytes: 213,929
≤       32768 bytes: 155,891         ≤       32768 bytes: 154,934
≤       65536 bytes: 111,177         ≤       65536 bytes: 111,188
≤      131072 bytes:  74,892         ≤      131072 bytes:  74,321
≤      262144 bytes:  45,665         ≤      262144 bytes:  45,394
≤      524288 bytes:  22,326         ≤      524288 bytes:  22,762
≤     1048576 bytes:  18,803         ≤     1048576 bytes:  19,363
≤     2097152 bytes:  11,589         ≤     2097152 bytes:  11,622
≤     4194304 bytes:   4,644         ≤     4194304 bytes:   4,885
≤     8388608 bytes:   3,958         ≤     8388608 bytes:   4,032
≤    16777216 bytes:   7,699         ≤    16777216 bytes:   7,709
≤    33554432 bytes:   4,066         ≤    33554432 bytes:   4,040
≤    67108864 bytes:     549         ≤    67108864 bytes:     543
≤   134217728 bytes:     308         ≤   134217728 bytes:     308
≤   268435456 bytes:     135         ≤   268435456 bytes:     136
≤   536870912 bytes:      79         ≤   536870912 bytes:      80
≤  1073741824 bytes:      53         ≤  1073741824 bytes:      49
≤  2147483648 bytes:      16         ≤  2147483648 bytes:      16
≤  4294967296 bytes:      11         ≤  4294967296 bytes:      11
≤  8589934592 bytes:       7         ≤  8589934592 bytes:       7
≤ 17179869184 bytes:       2         ≤ 17179869184 bytes:       2
≤ 34359738368 bytes:       2         ≤ 34359738368 bytes:       2
≤ 68719476736 bytes:       1         ≤ 68719476736 bytes:       1

That seems most likely to be from small files–although if, for instance, you had been experimenting with tiny values of txg_commit_interval and related tunables after the first snapshot was taken, that might also account for it.

I did indeed change the txg_commit_interval on both source (60) and target (30), and knowing me, I also did some experiments on the source set like setting the interval to like 1 s or maybe even lower. But I can’t find anything in my shell history. Also, if the value was very low, it probably was only for a few minutes up an hour max. I’m also pretty sure those experiments were all done before June, which is when the first snapshot was taken.

However, from time to time I do some experiments with ZFS by creating a few files and creating a test pool from it, and sometimes I create the experimental files in my home directory instead of a special dataset that is not snapshotted. I then do some things you are not supposed to do, just to see what happens, eg. creating a volume in a test pool and adding that volume to the same pool.
So it could be possible that some of those experiments made it into a snapshot, however, I’m pretty sure I did most of those experiments before June, so they shouldn’t be in the snapshot. Also those experiments were like 30-60 minutes long.

mercenary_sysadmin · August 14, 2024, 9:10pm

This definitely strikes me as unexpected outcome of an experiment.

Like, I’m not at all trying to blow you off, but this is extremely unusual behavior, and the only thing that makes it make sense for me is “from time to time I do experiments… and some things you are not supposed to do, just to see what happens.”

Joghurt · August 15, 2024, 10:13am

I mean, if that is the only explanation so far, I’m williing to accept it, even though I find it unlikely. Since the problematic snapshot is now on the backup box, the problem isn’t that pressing anymore, and I’m ok with letting the matter rest. But I’m willing to do some experiments to confirm that hypothesis if that would be useful.

Just to clarify, all experiments I do are on a separate newly created pool, but that pool is created on files located in the real pool. The only thing I’ve done that had a global effect was chaging txg_timeout, but im 99% sure that was long before the first snapshot, but that means there’s still a 1% uncertainty.

If I understand you correctly, it seems like a send stream replicates all the txgs that happened between those 2 snapshots? If that is the case, for science I could try setting txg_timeout to 0 or whatever zfs allows and copy a few largish files, then take another snapshot and send that to my box, and if that snapshot also takes ages to complete, it would be a good indication that’s the root cause; but if not, would that rule it out? Is there something else I could do besides lowering txg_timeout to make it more likely for the problem to appear?

mercenary_sysadmin · August 15, 2024, 10:26pm

I believe so, but all I can tell you for absolute certain is that the individual blocks are replicated over as-is. So, for example, if you’ve got recordsize=4K on the source, and recordsize=1M on the target, you’re still going to have files composed of 4KiB blocks, not 1MiB blocks, after replicating between the two.