Buffer size per dataset?

I have a dataset which has near-constant low IOPS writes (50 IOPS/disk right now). If I am willing to forego some safety in this dataset, can I force it to buffer a certain amount of data/IOPS before committing to disk?

I want fewer, bigger writes to my disk, but I don’t want to impact any other filesystem on the same pool if possible.

More information: This dataset is used for Storj, so I am OK with a little bit of data loss during power outage, since the network is self-healing and redundant.

The dataset is mounted as NFS and then passed into a Docker container.

VM’s /etc/fstab:
cleteserver.home:/tank/storj /mnt/storj nfs rw,hard 0 0

Output of zfs get all tank/storj:

 ✘  ⚡  ~  zfs get all tank/storj
NAME        PROPERTY              VALUE                                SOURCE
tank/storj  type                  filesystem                           -
tank/storj  creation              Thu Sep 19 20:06 2024                -
tank/storj  used                  32.8G                                -
tank/storj  available             26.7T                                -
tank/storj  referenced            32.8G                                -
tank/storj  compressratio         1.10x                                -
tank/storj  mounted               yes                                  -
tank/storj  quota                 none                                 default
tank/storj  reservation           none                                 default
tank/storj  recordsize            128K                                 default
tank/storj  mountpoint            /tank/storj                          default
tank/storj  sharenfs              no_root_squash,rw=@192.168.1.233/32  local
tank/storj  checksum              on                                   default
tank/storj  compression           on                                   inherited from tank
tank/storj  atime                 off                                  inherited from tank
tank/storj  devices               on                                   default
tank/storj  exec                  on                                   default
tank/storj  setuid                on                                   default
tank/storj  readonly              off                                  default
tank/storj  zoned                 off                                  default
tank/storj  snapdir               hidden                               default
tank/storj  aclmode               discard                              default
tank/storj  aclinherit            restricted                           default
tank/storj  createtxg             15647936                             -
tank/storj  canmount              on                                   default
tank/storj  xattr                 sa                                   inherited from tank
tank/storj  copies                1                                    default
tank/storj  version               5                                    -
tank/storj  utf8only              off                                  -
tank/storj  normalization         none                                 -
tank/storj  casesensitivity       sensitive                            -
tank/storj  vscan                 off                                  default
tank/storj  nbmand                off                                  default
tank/storj  sharesmb              off                                  inherited from tank
tank/storj  refquota              none                                 default
tank/storj  refreservation        none                                 default
tank/storj  guid                  15862464792787672646                 -
tank/storj  primarycache          all                                  default
tank/storj  secondarycache        all                                  default
tank/storj  usedbysnapshots       0B                                   -
tank/storj  usedbydataset         32.8G                                -
tank/storj  usedbychildren        0B                                   -
tank/storj  usedbyrefreservation  0B                                   -
tank/storj  logbias               latency                              default
tank/storj  objsetid              460959                               -
tank/storj  dedup                 off                                  inherited from tank
tank/storj  mlslabel              none                                 default
tank/storj  sync                  standard                             default
tank/storj  dnodesize             legacy                               default
tank/storj  refcompressratio      1.10x                                -
tank/storj  written               32.8G                                -
tank/storj  logicalused           34.8G                                -
tank/storj  logicalreferenced     34.8G                                -
tank/storj  volmode               default                              default
tank/storj  filesystem_limit      none                                 default
tank/storj  snapshot_limit        none                                 default
tank/storj  filesystem_count      none                                 default
tank/storj  snapshot_count        none                                 default
tank/storj  snapdev               hidden                               default
tank/storj  acltype               off                                  default
tank/storj  context               none                                 default
tank/storj  fscontext             none                                 default
tank/storj  defcontext            none                                 default
tank/storj  rootcontext           none                                 default
tank/storj  relatime              on                                   default
tank/storj  redundant_metadata    all                                  default
tank/storj  overlay               on                                   default
tank/storj  encryption            off                                  default
tank/storj  keylocation           none                                 default
tank/storj  keyformat             none                                 default
tank/storj  pbkdf2iters           0                                    default
tank/storj  special_small_blocks  0                                    default
tank/storj  prefetch              all                                  default

This question made me think of zfs transaction groups, but I think that’s a pool-wide setting. I think the default groups writes into 5-second chunks (unless that’s specifically a FreeBSD behavior—I read about this in Michael Lucas and Allan Jude’s book).

I have a dataset which has near-constant low IOPS writes (50 IOPS/disk right now). If I am willing to forego some safety in this dataset, can I force it to buffer a certain amount of data/IOPS before committing to disk?

Not directly. What you’re asking for specifically is an adjustment to txg_sync_interval, but that’s system wide–can’t even be restricted per pool, let alone per dataset.

What you can do is increase the recordsize on that store, which will force it to commit writes much more efficiently, offering you essentially exactly the benefit you’re looking for (and, arguably, more).

The only downside is that if that dataset ever becomes extremely busy with small-block random access inside larger files, you could end up with some read amplification. But it doesn’t sound like that’s likely to be a concern here.

3 Likes

Here’s what I’ve done so far, and it has reduced the write load some:

  • Set 128MB buffer in Storj config file
  • Set sync=disabled on tank/storj

Now I just see the disk writing once per 5 seconds rather than constantly.

I attempted to bump up recordsize but that had no meaningful effect. I left it at 1MB but zpool iostat -r shows very few writes over 64K.

I did try echo 60 > /sys/module/zfs/parameters/zfs_txg_timeout, which is effective in causing fewer writes, but this scares me to apply to the entire system. I have a UPS, but it has no method of communication with my server when it is about to run out of juice. I fear that 60s may cause more data loss in case of power outage. Those are rare where I live, but still possible.

I may actually keep it at 60s transaction commit interval. The server is in my office that I spend all day in M-F and I can hear the seeking. I kind of like the quiet. I assume that losing 60s of async data isn’t all bad? Most of my storage is for backups.

I don’t know anything about “Storj”–never heard of it before just now–but if it rewrites to already-existing files, changing recordsize will have no effect without forcibly rewriting all the existing files in full by brute copy operation (meaning, not ZFS replication, you have to actually use cp or a similar tool).

Changing the recordsize property on a dataset only affects new files committed to that dataset after the property is set, and has no effect on existing files–including existing files which are later modified. Only new files. :slight_smile:

1 Like

I knew that most of the activity is brand new writes so that’s why I could say that the record size wasn’t making a big difference. I will investigate average file size later.

I set record size back to 128K for now. I may do a more extensive analysis later. For now the 60s transaction time is helping a lot. It seems to do much less writing. Have you increased yours ever?

Yep. For that matter, the default used to be thirty seconds, not five. You get higher maximum throughput, especially under heavy load, at the expense of increased latency–also especially under heavy load.

It’s not something I’d be happy with on a machine that had a desktop interface on it, but it can be fine on a machine with a more dedicated and less latency sensitive workload.

1 Like

Storj usually issues only large writes to segment files. record size=1M is good. However, it also stores some SQLite databases. I would recommend moving those to a separate dataset or ideally SSD storage. There are instructions provided by the Storj community on how to do this. I would also re-enable sync writes.

2 Likes