Slow read, fast write and replacement config

erschand · August 7, 2023, 4:12pm

Hi. Posting here as I have no idea where else to ask. If this is wrong, please let me know.

I have an Oracle (Sun branded) 7320 system. Pretty much the one listed here:
https://dogemicrosystems.ca/pub/Sun/System_Handbook/Sun_syshbk_V4.1/Systems/7320/components.html
The Sun Fire X4170 with the disk shelf.
It is running Solaris 11 Express
The disk shelf contains 20 x 3TB SAS disks and 2 x 200GB SAS SSD.
All connected through 10Gb network links. iperf tests show 9.7Gb/s or more to all systems.

I have a zpool created called “sunnas”
It looks like this:
pool: sunnas
state: ONLINE
scan: resilvered 4.80T in 19h24m with 0 errors on Wed Jun 14 17:47:04 2023
config:

    NAME                       STATE     READ WRITE CKSUM
    sunnas                     ONLINE       0     0     0
      raidz2-0                 ONLINE       0     0     0
        c3t5000C50040CFADE7d0  ONLINE       0     0     0
        c3t5000C50040CFAE17d0  ONLINE       0     0     0
        c3t5000C50040CFAFA7d0  ONLINE       0     0     0
        c3t5000C500413D5F07d0  ONLINE       0     0     0
        c3t5000C500413D45E3d0  ONLINE       0     0     0
        c3t5000C500413D504Fd0  ONLINE       0     0     0
        c3t5000C500413D577Fd0  ONLINE       0     0     0
        c3t5000C500413D687Bd0  ONLINE       0     0     0
        c3t5000C500413D4663d0  ONLINE       0     0     0
        c3t5000C500413D8587d0  ONLINE       0     0     0
        c3t5000C500413DCF4Bd0  ONLINE       0     0     0
      raidz2-4                 ONLINE       0     0     0
        c3t5000C50040CFADC3d0  ONLINE       0     0     0
        c3t5000C50040CFAE07d0  ONLINE       0     0     0
        c3t5000C50040CFAFF3d0  ONLINE       0     0     0
        c3t5000C50040CFB497d0  ONLINE       0     0     0
        c3t5000C500413D8C37d0  ONLINE       0     0     0
        c3t5000C500413D57D7d0  ONLINE       0     0     0
        c3t5000C500413D88BFd0  ONLINE       0     0     0
    cache
      c3t5000CCA04E0A805Cd0    ONLINE       0     0     0
      c3t5000CCA04E0A6354d0    ONLINE       0     0     0
    spares
      c3t5000CCA03EC9977Cd0    AVAIL

If I flip the cache SSDs to log, I can sustain 1GB/s writes.
In both the above config and with the SSDs as cache, I can only get about 116MB/s reads. The L2ARC is using all of it:
$ kstat |grep l2_size
l2_size 400065185280

l2_misses grows continuously as well.

If I try something like “time cp file_on_zfs /dev/null” it will also calculate out to about 100MB/sec.

The disk usage is:
Filesystem size used avail capacity Mounted on
sunnas/local/sunnas/sunnas
37T 31T 6.5T 83% /export/sunnas

What’s the problem? Where do I begin? I’m hoping someone can point me in the right direction.

Which leads to my 2nd question. I am moving data around, making a backup of it. Then I will remove the Sun Fire system and directly connect the disk shelf to my home system (I7 10700F, 32GB mem, nvme boot, etc etc) running Fedora 37. Will this system suffice for openzfs? For the disk in the disk shelf, what is a good config for zfs such that I can get close to 40TB of disk plus 10Gb throughput?

Thanks in advance,
ers

mercenary_sysadmin · August 8, 2023, 3:43pm

Note that you won’t be able to import your Oracle ZFS pool into an OpenZFS system–you’ll need to wipe the old one, create a new one, and restore your data from backup.

As for the issues with file reads tending to go at around 100MiB/sec–you’ve only got two vdevs, both RAIDz on rust, so performance isn’t going to be stellar for a whole lot of workloads. How to improve them would depend strongly on the actual workload in question, which we don’t know anything about, so there isn’t much to say about that. Can you describe what kinds of read test you’re issuing to see those results?

erschand · August 8, 2023, 9:12pm

Hi, thanks for the reply.

I am currently backing it all up to other storage to do a restore after recreating the zfs. That’s when I noticed how slow the read was. It will eventually move from Solaris ZFS to a Linux based ZFS. [which is why I’m wondering how I should setup the “new” zfs array for performance and some redundancy benefit, before I restore]

The data is large video files. Terabytes of them, ranging from 5GB to 80GB each. The files are read and written to the storage array.

The most evident is running “rsync --progress nfs:/from/zfs/ /to/filesystem” It shows anywhere between 102MB/sec to 116MB/sec on the large files.

Going the other way, writing to zfs, “rsync --progress /from/filesystem nfs:/to/zfs” gives ~1GB/sec sustained.

iperf between the host and array gives 9.7Gb/sec both ways, it doesn’t matter which is the server or client.

This is a home system with one user, most traffic is in bursts (ie. write a whole file, or read a whole file)

I also tried setting up a hardware 7 disk raid5 array to copy to on the host. It also would only copy around 115MB/sec.

Having said all that, I have all but backed up a few terabytes of data. So I’m willing to let the Sun ZFS go. But I would still like to know a handy config for a 20 x 3tb array, 2 hot spares, 2 x 200GB log disks, for the data use case above?

Thanks!
ers

mercenary_sysadmin · August 8, 2023, 10:14pm

large video files. Terabytes of them, ranging from 5GB to 80GB each.

OK, then first thing is you want recordsize=1M, as long as you’re not editing those video files in-place (which is a pretty uncommon workload). That’s going to be true for this workload regardless of the rest of your decisions.

Next up, you want as many vdevs as you can manage within your other constraints. I’m seeing eighteen drives there, so that would likely mean three 6-wide RAIDz2, although if you have reliable backup, you could consider six 3-wide RAIDz1 instead–you’d only get single parity, but you’d get double the performance, for the same 67% storage efficiency.

You could also consider nine 2-wide mirrors, for even higher performance (particularly on reads). But I’m guessing you’re going to prefer higher storage efficiency instead.

A good general rule of thumb is to expect each individual vdev–no matter what size–in a well-configured pool to perform roughly as well as a single drive. There will be circumstances where they do much better than that, sure, but in general, expect the performance of a single drive per vdev. So you’d expect to be able to read something like 900MiB/sec from these eighteen drives configured in nine two-wide mirrors, 600MiB/sec from them configured in six three-wide Z1, or 300MiB/sec from them configured in three six-wide Z2.

Another rule of thumb: performance increases in multi-drive storage systems tend to show up most strongly in the ability to service more parallel requests, rather than the speed of an individual request on a lightly loaded or unloaded system.

mercenary_sysadmin · August 8, 2023, 10:20pm

Whew, I just noticed the size of those disks. 3TB?

This system is pulling down an insane amount of power (and generating heat and noise to go with) for not much in the way of capacity or performance, by comparison. It might be worth seriously considering downsizing in complexity and upsizing in capacity with a shift to more modern drives. 20TB Ironwolf Pro drives are running less than $400 apiece on Amazon right now.

Granted, three Ironwolf Pro aren’t likely to get you more than 150-200MiB/sec on single-process reads with recordsize=1M. But I have to warn you, massive arrays of small elderly drives get old fast, and I speak from experience here. It’s usually best to avoid the complexity once it’s no longer necessary.

Don’t get me wrong, I’m not trying to gatekeep–if this is a fun project just the way it is, and you’re aware of the complexity issues, charge right ahead and I’ll be happy to help! I just didn’t feel like I’d be doing you a favor not to at least mention the issue.

erschand · August 9, 2023, 1:02am

Okay, thanks for the tips. I will set the record size and such, that make sense. I’ll probably do a 3 x 6 wide configuration and take the read hit. As long as it makes sense, which it wasn’t until further reading and your help. I’m still a bit confused as to why it writes so fast but reads slow. I would have thought the same paths for reading and writing would give similar results. But I think I’m catching on.

As for why this old stuff? It was given to me 5 years ago by a company downsizing massively. So I took it. Not only it but a stack of replacement parts and drives. The problem becomes that I’m not in the USA, so factor the US $ at 1.3 dollars for me. I priced out 6 x 10-20TB disks plus a 10Gb enclosure and it came out to $4000. Also, electricity here is cheap. 10 cents/kwh cheap. So if you tell me I’m using 200W of extra power over and above? So, <$15/month into $4000…you can see where this goes.

So I just leave it alone, never touch it, never break it. Until now. And the truly critical data is stored in various other spots.