Syncoid -r hangs

cjr · August 15, 2024, 10:00am

Good morning. I recently migrated my server from Ubuntu to FreeBSD. I had a working Sanoid setup before. Since the migration, I’ve tried to send some datasets to both other pools on the same machine and the remote host (with no changes) that I was using previously. I getting into the situation in the screenshot. The transfer begins without any errors but at some point always hangs. I have confirmed with zpool iostat the relevant disks have little to no activity when this is happening. I have a minimally edited sanoid.conf that has worked for me since I started using sanoid. I’m kind of lost on what else to do to troubleshoot this and could use some help. I tried to post additional screen shots but the system won’t let me because I’m new. I can provide additional information as needed.

Terminal output:

tenkara_vt · August 15, 2024, 11:40am

You can copy the text from your terminal and either select the code box/preformatted text box in the Discourse editing toolbar, or you can start a new line with three back ticks (```) to start creating a code box. Just make sure your final line also ends in three back ticks to close the box.

Doing that, you don’t have to worry about uploading screenshots. Besides, it’s much easier for other users to read.

mercenary_sysadmin · August 15, 2024, 10:25pm

In your screenshot, you see the command line it’s showing you that begins with ‘zfs send’? That’s the actual command syncoid constructs for you, and you can run the same thing from the command line yourself.

So, here’s the next thing you want to do for troubleshooting purposes: see where the ‘zfs send’ command then gets piped into ‘mbuffer’ and then ‘pv’? Change that pv command to pv > /dev/null and then try running the command.

What you’re finding out by doing that is whether it’s the zfs send process that’s hanging, or something after the zfs send process. So let’s give that a shot, and report back please.

cjr · August 15, 2024, 11:33pm

Thanks for your time Mr. Salter. Hopefully I’ve done this correctly:

root@prod0:~ # zfs send  -t 1-ef666d076-e8-789c636064000310a500c4ec50360710e72765a5269730301842d560c8a7a515a7968064f8e0f26c48f2499525a9c540fa8553d8216cfa4bf2d34b335318187a4367f62eba3a573002499e132c9f97989bcac0909699935aac9f9c55a41fe4ef1bec505c99979c9f99129f94989c5d5a501c6f646064a26b60ae6b6461656000428646baeebe21ba06a6400e0314000011ee25e5 | mbuffer  -q -s 128k -m 16M | pv > /dev/null |  zfs receive  -s -F 'new_files/files/cjr/ROMS' 2>&1
cannot receive: failed to read from stream

I’m guessing “cannot receive:…” is the culprit?

mercenary_sysadmin · August 15, 2024, 11:36pm

no, you were supposed to STOP the command at “pv > /dev/null”, not then add the rest of the stuff to the end of it.

cjr · August 16, 2024, 12:06am

So this then?

root@prod0:~ # zfs send  -t 1-ef666d076-e8-789c636064000310a500c4ec50360710e72765a5269730301842d560c8a7a515a7968064f8e0f26c48f2499525a9c540fa8553d8216cfa4bf2d34b335318187a4367f62eba3a573002499e132c9f97989bcac0909699935aac9f9c55a41fe4ef1bec505c99979c9f99129f94989c5d5a501c6f646064a26b60ae6b6461656000428646baeebe21ba06a6400e0314000011ee25e5 | mbuffer  -q -s 128k -m 16M | pv > /dev/null
0.00  B 0:00:00 [0.00  B/s] [<=>                                                                                                                                                                                                                       ]
mbuffer: error: outputThread: error writing to <stdout> at offset 0x0: Broken pipe
mbuffer: warning: error during output to <stdout>: Broken pipe
warning: cannot send 'files/cjr/ROMS@syncoid_backups_2024-07-28:00:00:12-GMT-05:00': signal received

mercenary_sysadmin · August 16, 2024, 12:17am

OK, now we’re getting somewhere. This is a FreeBSD system, right?

I think pv is broken in pkgs on FreeBSD right now. Try pkg remove pv and running syncoid again; without pv available syncoid drops back to a more primitive built-in progress bar.

If your syncoid command completes fine with pv uninstalled, that confirms presence of the bug in pv, which you’ll need to take up with FreeBSD upstream. You should be able to install an older version or compile from source or something, but I’m not the last-mile support for that.

cjr · August 16, 2024, 12:31am

It is FreeBSD. But it wants to yank sanoid out as well.

root@prod0:~ # pkg remove pv
Checking integrity... done (0 conflicting)
Deinstallation has been requested for the following 2 packages (of 0 packages in the universe):

Installed packages to be REMOVED:
	pv: 1.8.10
	sanoid: 2.2.0

mercenary_sysadmin · August 16, 2024, 12:45am

Ok, then before we do that, try just piping the zfs send straight to /dev/null.

So zfs send (arguments from your debug output) > /dev/null.

You won’t get a progress bar for that, which sucks, but it’ll confirm whether pv is the issue.

cjr · August 16, 2024, 1:33am

I took some action before I saw your latest message. I can go back and do as you recommended when what I did finishes. What I did was remove the pv package and consequently, sanoid. Then I went into the Makefile for the sandoid port, deleted the pv, and built and installed the port. It is a tad indelicate, but I’m watching it with zpool iostat and so far it is working. Does that do enough to confirm the issue?

Edit: I created a boot environment first. I can roll back if it really screws something up.

cjr · August 16, 2024, 1:40am

Never mind. It hung up again:

pool                     alloc   free   read  write   read  write
-----------------------  -----  -----  -----  -----  -----  -----
files                    1.19T  2.43T      0      0      0      0
  mirror-0               1.19T  2.43T      0      0      0      0
    gpt/WD-WCC7K4VNZP91      -      -      0      0      0      0
    gpt/WD-WCC7K6CU8NCR      -      -      0      0      0      0
-----------------------  -----  -----  -----  -----  -----  -----
                           capacity     operations     bandwidth 
pool                     alloc   free   read  write   read  write
-----------------------  -----  -----  -----  -----  -----  -----
files                    1.19T  2.43T      0      0      0      0
  mirror-0               1.19T  2.43T      0      0      0      0
    gpt/WD-WCC7K4VNZP91      -      -      0      0      0      0
    gpt/WD-WCC7K6CU8NCR      -      -      0      0      0      0
-----------------------  -----  -----  -----  -----  -----  -----
                           capacity     operations     bandwidth 
pool                     alloc   free   read  write   read  write
-----------------------  -----  -----  -----  -----  -----  -----
files                    1.19T  2.43T      0      0      0      0
  mirror-0               1.19T  2.43T      0      0      0      0
    gpt/WD-WCC7K4VNZP91      -      -      0      0      0      0
    gpt/WD-WCC7K6CU8NCR      -      -      0      0      0      0
-----------------------  -----  -----  -----  -----  -----  -----
                           capacity     operations     bandwidth 
pool                     alloc   free   read  write   read  write
-----------------------  -----  -----  -----  -----  -----  -----
files                    1.19T  2.43T      0      0      0      0
  mirror-0               1.19T  2.43T      0      0      0      0
    gpt/WD-WCC7K4VNZP91      -      -      0      0      0      0
    gpt/WD-WCC7K6CU8NCR      -      -      0      0      0      0
-----------------------  -----  -----  -----  -----  -----  -----
                           capacity     operations     bandwidth 
pool                     alloc   free   read  write   read  write
-----------------------  -----  -----  -----  -----  -----  -----
files                    1.19T  2.43T      0      0      0      0
  mirror-0               1.19T  2.43T      0      0      0      0
    gpt/WD-WCC7K4VNZP91      -      -      0      0      0      0
    gpt/WD-WCC7K6CU8NCR      -      -      0      0      0      0
-----------------------  -----  -----  -----  -----  -----  -----

mercenary_sysadmin · August 16, 2024, 2:04am

Since you got a deadlock again, let’s just pipe the zfs send straight to /dev/null, please. If that also locks up, there’s something wrong with your pool. If the zfs send piped to /dev/null completes fine, then we need to look further down the chain.

cjr · August 17, 2024, 12:07am

I had to quit last night. I left the command running and it finished eventually. I went ahead and ran the command you asked for:

root@prod0:~ # zfs send  -t 1-ef666d076-e8-789c636064000310a500c4ec50360710e72765a5269730301842d560c8a7a515a7968064f8e0f26c48f2499525a9c540fa8553d8216cfa4bf2d34b335318187a4367f62eba3a573002499e132c9f97989bcac0909699935aac9f9c55a41fe4ef1bec505c99979c9f99129f94989c5d5a501c6f646064a26b60ae6b6461656000428646baeebe21ba06a6400e0314000011ee25e5 > /dev/null

I’m not sure where I stand now.

mercenary_sysadmin · August 17, 2024, 12:13am

Well, we know that at a minimum, your system will eventually get through a zfs send. Unfortunately what we have no idea about is how long that takes, or whether there are long hangs in the middle.

You could maybe try it again, but this time time zfs send [arguments] > /dev/null ?

As long as you know how long the send actually is–which you can first get by appending the --dry-run argument to the zfs send command–you can compare that with the time elapsed to figure out the average speed, which will in turn tell you how long you really ought to wait before concluding that the original syncoid would never complete.

That’ll also give us some indication about what to expect in general, how practical it might be to try manually scp’ing the send stream over to the other machine to manually zfs receive it there (for troubleshooting purposes, not as a permanent solution), etc.

cjr · August 17, 2024, 12:39pm

Here is, hopefully, what you asked for. It really seems to not like the snapshots from 7-28 for some reason.

root@prod0:~ # zfs send -nv files/cjr/Pictures@syncoid_backups_2024-07-28:00:00:10-GMT-05:00 > /dev/null
root@prod0:~ # zfs send -nv files/cjr/Pictures@syncoid_backups_2024-07-28:00:00:10-GMT-05:00 |zfs recv media/test/garbage
cannot receive: failed to read from stream
root@prod0:~ # time zfs send  files/cjr/Pictures@syncoid_backups_2024-07-28:00:00:10-GMT-05:00 > /dev/null 
      586.80 real         0.00 user        16.44 sys

I cloned that snapshot and mounted it where I could look at the innards. I don’t see anything wrong, but I did not look all 20 years worth of pictures.

mercenary_sysadmin · August 17, 2024, 2:48pm

Not quite. You don’t pipe the dry run send to anywhere. Piping it to dev null just hid the output (which would have told us how large the stream is).

cjr · August 17, 2024, 3:14pm

root@prod0:/mnt/broken # zfs send --dryrun -v files/cjr/Pictures@syncoid_backups_2024-07-28:00:00:10-GMT-05:00
full send of files/cjr/Pictures@syncoid_backups_2024-07-28:00:00:10-GMT-05:00 estimated size is 110G
total estimated size is 110G

mercenary_sysadmin · August 17, 2024, 3:36pm

Ok good, you got around 200MiB/sec. Perfectly healthy. Problem isn’t the send process. I still suspect it’s a bug in pv or mbuffer, but let’s keep isolating out “things that work by themselves” to be sure.

Can you run iperf3 on both sides to look for network issues? (You may need to pkg install it first.)

you@source:~$ iperf3 -s

you@target:~$ iperf3 -c source ; iperf3 -Rc source

This will show us what the network looks like in both directions.

cjr · August 20, 2024, 10:15am

Caught a bunch of OT this week. Sorry it took so long.

mercenary_sysadmin · August 20, 2024, 4:08pm

OK, you’ve clearly got one of the network interface chipsets that FreeBSD doesn’t have a decent driver for, from the looks of those iperf runs (the 877Mbps in one direction is decent if mediocre, but the pairing of it with 640Mbps runs in the other direction makes it clear this is one of the chipsets FreeBSD doesn’t like). But while that’s an issue, I’m not sure it’s the issue that you’re looking to solve.

Before asking you to buy a different network card, let’s make sure that you can send and receive successfully when the network is not in play:

root@prod0:~ # zfs send  -t 1-ef666d076-e8-789c636064000310a500c4ec50360710e72765a5269730301842d560c8a7a515a7968064f8e0f26c48f2499525a9c540fa8553d8216cfa4bf2d34b335318187a4367f62eba3a573002499e132c9f97989bcac0909699935aac9f9c55a41fe4ef1bec505c99979c9f99129f94989c5d5a501c6f646064a26b60ae6b6461656000428646baeebe21ba06a6400e0314000011ee25e5  > /path/to/send.raw

Then rsync or scp that file to the other machine, where you’ll run this command:

root@target:~# cat /path/to/send.raw |  zfs receive  -s -F 'new_files/files/cjr/ROMS

That will both make sure that you actually can send and receive between these datasets successfully, and get you up-to-date with backups for the moment while we look at our next move.