Scrub slows to a crawl without errors

rdaneelolivaw · August 16, 2023, 4:03am

Hi,

I noticed my Proxmox box’s (> 2 years with no issues) 10x10TB array’s monthly scrub is drastically slowing down, I have some details of the system and what I’ve checked, does anyone have an idea of what else I could do to figure this out?

I monitor and record all SMART data in influxdb and plot it – no fail or pre-fail indicators show up, I’ve also manually checked smartctl -a on all drives.

dmesg shows no errors, the drives are connected over three 8643 cables to an LSI 9300-16i, system is a 5950X, 128GB RAM, the LSI card is connected to the first PCIe 16x slot and is running at PCIe 3.0 x8.

The OS is always kept up to date, these are my current package versions:
libzfs4linux/stable,now 2.1.12-pve1 amd64 [installed,automatic]
zfs-initramfs/stable,now 2.1.12-pve1 all [installed]
zfs-zed/stable,now 2.1.12-pve1 amd64 [installed]
zfsutils-linux/stable,now 2.1.12-pve1 amd64 [installed]
proxmox-kernel-6.2.16-6-pve/stable,now 6.2.16-7 amd64 [installed,automatic]

As the scrub runs, it slows down and takes hours to move single percentage point, the time estimate goes up a little every time but there are no errors, this run started with an estimate of 7hrs 50min (which is about normal)
pool: pool0
state: ONLINE
scan: scrub in progress since Wed Aug 16 09:35:40 2023
13.9T scanned at 1.96G/s, 6.43T issued at 929M/s, 35.2T total 0B repaired, 18.25% done, 09:01:31 to go
config:
NAME STATE READ WRITE CKSUM
pool0 ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
ata-WDC_WD100EFAX-68LHPN0_ ONLINE 0 0 0
ata-WDC_WD100EFAX-68LHPN0_ ONLINE 0 0 0
ata-WDC_WD100EFAX-68LHPN0_ ONLINE 0 0 0
ata-WDC_WD100EFAX-68LHPN0_ ONLINE 0 0 0
ata-WDC_WD100EFAX-68LHPN0_ ONLINE 0 0 0
ata-WDC_WD100EFAX-68LHPN0_ ONLINE 0 0 0
ata-WDC_WD100EFAX-68LHPN0_ ONLINE 0 0 0
ata-WDC_WD100EFAX-68LHPN0_ ONLINE 0 0 0
ata-WDC_WD101EFAX-68LDBN0_ ONLINE 0 0 0
ata-WDC_WD101EFAX-68LDBN0_ ONLINE 0 0 0

errors: No known data errors

mercenary_sysadmin · August 16, 2023, 4:13pm

How full is the pool getting? I’m having to kinda guess by the state of that scrub process line you posted, but if I’m interpreting it correctly, it looks like it’s pretty full.

The more full a pool gets, the worse fragmented its free space gets. The worse fragmented its free space gets, the worse fragmented new writes necessarily become. The more fragmented your writes get, the more seeks a scrub (or any other access) needs to perform per TiB of data.

Can we see a full zpool status output, please?

rdaneelolivaw · August 18, 2023, 5:34am

zpool list
NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
pool0 91.0T 35.6T 55.4T - - 3% 39% 1.00x ONLINE -

this is the space for the pool in question:

zfs list pool0
NAME USED AVAIL REFER MOUNTPOINT
pool0 27.1T 42.1T 384K /pool0

ran an iostat on the pool as the scrub decelerates, doesn’t seem like any individual disk is slow, but i’ll try to hdparm -t each of them over the weekend:

zpool iostat -v pool0
                                          capacity     operations     bandwidth
pool                                    alloc   free   read  write   read  write
--------------------------------------  -----  -----  -----  -----  -----  -----
pool0                                   35.7T  55.2T  3.33K     38   270M  1.74M
  raidz2-0                              35.7T  55.2T  3.33K     38   270M  1.74M
    ata-WDC_WD100EFAX-68LHPN0_      -      -    297      3  27.0M   178K
    ata-WDC_WD100EFAX-68LHPN0_      -      -    276      3  27.0M   178K
    ata-WDC_WD100EFAX-68LHPN0_      -      -    284      3  27.0M   178K
    ata-WDC_WD100EFAX-68LHPN0_      -      -    313      3  27.0M   178K
    ata-WDC_WD100EFAX-68LHPN0_      -      -    256      3  27.0M   178K
    ata-WDC_WD100EFAX-68LHPN0_      -      -    358      3  27.0M   179K
    ata-WDC_WD100EFAX-68LHPN0_      -      -    253      3  27.1M   178K
    ata-WDC_WD100EFAX-68LHPN0_      -      -    369      3  27.0M   178K
    ata-WDC_WD101EFAX-68LDBN0_      -      -    474      3  26.9M   179K
    ata-WDC_WD101EFAX-68LDBN0_      -      -    524      3  26.9M   178K
--------------------------------------  -----  -----  -----  -----  -----  -----

rdaneelolivaw · August 19, 2023, 8:13pm

An update: I booted the system from a Proxmox 7.4 install iso (these scrub stalls only started after I upgraded to pve8) and the scrub finished in 8 hours just as it has done for the last few years!

Some details on the boot environment:
linux-5-15-101-1-pve
zfs-2.1.9-pve1
zfs-kmod-2.1.9-pve1

If anyone has a clue what is going on, I’d like to know.

My next steps:
a/ boot back into pve 8 and see if I’ve tweaked some sysctl, etc that could cause this
b/ reinstall pve8 for which I need to order some new boot drives
c/ re-read the zfs changelog for answers
d/ report this on the proxmox forums?

alexou · July 3, 2024, 3:59pm

The more full a pool gets, the worse fragmented its free space gets. The worse fragmented its free space gets, the worse fragmented new writes necessarily become. The more fragmented your writes get, the more seeks a scrub (or any other access) needs to perform per TiB of data.

Sorry to revive this thread, trying to understand more about fragmentation.

It makes sense that the less free space there is, the more fragmented write become. If you create more free space by deleting data, does this defragment the new free space? I would guess not, as deletions would become very expensive. So is it like, aside from resilvering a drive, the fragmentation level is dependent on however much data was ever written to the drive, rather than the amount of free space currently available?

mercenary_sysadmin · July 3, 2024, 7:41pm

Not really. The more you delete, the larger the contiguous blocks of free space will be. You can’t specifically defrag metaslabs directly, of course, but if you are at 90% full and you delete enough stuff to be at 75% full, you’re going to have considerably less-fragmented areas of free space after the deletions than before.

Also note: the effects of fragmentation are tremendously worse with smaller recordsizes than with larger ones. If you’re storing large files on datasets with recordsize=1M, you will effectively never suffer any real fragmentation issues, because you’ll never hit worse access patterns than 1MiB random I/O.

alexou · July 3, 2024, 8:45pm

Ahh ok, got it. I had set recordsize=64k, perhaps I should be doing 1M, for a workload that is a lot of large gzipped csvs, and qemu images.

Thanks!