OpenZFS Topology FAQ: What's a zpool? What's a vdev?

mercenary_sysadmin · June 29, 2023, 6:21pm

#Hi, ZFSers!

If you’re new to ZFS, you’re probably not entirely solid on how the structure of the system works. This post is here to help, complete with lovely ASCII diagrams (that don’t require loading other links, since Reddit can’t embed images without use of browser extensions).

In word form, it’s fairly simple: The top level structure is the zpool. A zpool consists of one or more vdevs. Each vdev consists of one or more disks.

Standard vdevs are where ZFS stores your files. Support vdevs (all of which are optional) may accelerate your zpool by offering pool-wide services which decrease latency and increase throughput in various ways.

ASCII art time!

Before we go any further, let’s look at that promised ASCII diagram.

—————————————————————————————————————————————————————————
|   top level: the zpool. zpools are self-contained.    |
|  A zpool contains zero or more support vdevs, which   |
|  service the entire pool as a whole, and one or more  |
|    standard vdevs, which contain the pool's data.     |
|                                                       |
| Each vdev consists of one or more disks, and can have |
| one of several topologies: single disk, n-way mirror, |
| RAIDz1/2/3 (stripe with 1, 2, or three parity blocks) |
|                                                       |
|              Support (optional) vdevs                 |
|    --------   ---------   ----------------------      |
|    ! SLOG !   ! L2ARC !   ! special allocation !      |
|    --------   ---------   ---------------------!      |
|                                                       |
|            Standard (main storage) vdevs              |
|  ————————————————————————  ------------------------   |
|  | —————— ------ ------ |  ! —————— ------ ------ !   |
|  | |disk| !disk! !disk! |  ! |disk| !disk! !disk! !   |
|  | —————— ------ ------ |  ! —————— ------ ------ !   |
|  ————————————————————————  ------------------------   |
|  ------------------------  ------------------------   |
|  ! —————— ------ ------ !  ! —————— ------ ------ !   |
|  ! |disk| !disk! !disk! !  ! |disk| !disk! !disk! !   |
|  ! —————— ------ ------ !  ! —————— ------ ------ !   |
|  ------------------------  ------------------------   |
—————————————————————————————————————————————————————————

Beautiful, right? A box enclosed by |———| is mandatory; one enclosed by !----! is optional. So we can see the visually reinforced concepts outlined above: a pool contains at least one storage vdev, which contains at least one disk. And we might or might not have support vdevs.

Note: the diagram only shows up to three disks in each vdev and four vdevs in its pool, but that’s a visual limitation of the diagram, not an actual ZFS limitation. There are no hard limits to disks-per-vdev or vdevs-per-pool; although there are some practical limitations in real life, they’re nowhere near this small.

Fault tolerance: zpool level

There is no parity or redundancy at the pool level; all fault tolerance is at the vdev level. If you lose any storage vdev (or if you lose a special allocation vdev), you lose the entire pool along with it.

Let’s repeat that one more time: if you lose a vdev, you lose the whole pool! Think about that before you decide to skimp on parity or redundancy in any single vdev.

The exceptions are L2ARC and SLOG vdevs. The pool will still be functional after the loss of either L2ARC or SLOG (although if the SLOG contained data, and was lost after a system crash, the data in the SLOG will be lost).

NOTE: for the most part, writes are distributed among vdevs in proportion to the amount of free space each has available. This allows the pool to continue operating until all vdevs have become full, even when they are differently sized or unevenly balanced. There is some provision in newer ZFS versions for writing preferentially to lower-latency vdevs in some cases, but it should not be relied upon to materially alter the general write distribution behavior.

Fault tolerance: vdev level

Any vdev—including the support vdev types—can be fault tolerant, or non-fault-tolerant. It’s up to you, the zfs admin, to make good choices here.

These are your vdev topologies:

Single disk vdev—no fault tolerance. Lose the disk, lose the vdev; if that vdev was storage or special allocation, lose the zpool with it.
n-way mirror vdev—fault tolerance through redundance. Each disk in a mirror vdev has the same data on it as every other disk in the mirror vdev. Usually seen in 2-way mirrors, but 3-way and up are entirely possible. The vdev survives as long as any single disk survives—but once it’s down to one disk only, it’s “uncovered”, meaning that any existing corrupt blocks are permanently irreparable, as well as any blocks which become corrupted while the mirror is down to a single disk.
RAIDzn vdev—all disks in the vdev are striped, with n blocks of parity per stripe, where n=1, 2, or 3. If you lose n disks in this vdev, it becomes “uncovered”. Losing n+1 disks makes the vdev entirely inoperable.

Support `vdev` types

L2ARC: The Layer 2 ARC is a simple read cache, which operates a step below the ARC (Adaptive Replacement Cache, which is in RAM). This vdev type is nowhere near as useful as many new users expect it to be—even a fast SSD is tremendously slower than RAM, and the L2ARC isn’t actually an ARC at all, it’s a simple ring buffer populated with blocks evicted from the ARC. While an L2ARC can be useful in certain circumstances, the odds are extremely high that it’s not useful in your circumstances. WARNING: indexing L2ARC eats system RAM; an over-large L2ARC can end up costing you far more RAM (which then can’t be used for ARC, standard filesystem cache, or applications) than it’s worth. Caution is strongly advised here.
SLOG—the Secondary LOg Device is not, as commonly misunderstood, really a write cache. It’s just a really fast place to store the ZIL (Zfs Intent Log). The ZIL is a place on non-volatile storage where ZFS can immediately dump a copy of any dirty data present when sync() is called, in order to honor sync()'s request to ensure data in flight is crash-safe before the system does anything else. This only affects synchronous writes, and the data in ZIL is only read from after a system crash. In normal operation, sync writes are dumped immediately in the ZIL, then written out as normal, later, in TXGs (TransaXtion Groups) onto main storage along with async writes. Having a SLOG on fast SSD storage can dramatically increase throughput and decrease latency in a workload with many sync writes; it won’t have any effect at all on a workload that’s already asynchronous. NOTE: a SLOG does not need to be any larger than the amount of data which might flow through the system for vfs.zfs.txg.timeout seconds—5 seconds by default. But SSD endurance is measured in total drive writes per day, and a SLOG sees an enormous amount of writes on systems that need it: so you’ll want bare minimum 256GB for any non-Optane SLOG. (Optane can be smaller because it has massively higher write endurance than “normal” NAND flash SSDs.)
special (allocation) vdev—the new special vdev class, only present in ZFS 0.8 and above, allows for storage of system metadata (and, optionally, very small writes) separately from main storage. This allows extremely latency-sensitive operations (file and directory metadata, small database operations, etc) to be stored on the special type vdev, which should be constructed of very fast SSDs. WARNING: if you lose a special allocation vdev, you lose the whole pool with it—we do not recommend using a single disk topology here!

grahamperrin · July 8, 2023, 6:52pm

Terminology

zpool is the command.

A ZFS storage pool is:

a logical collection of devices that provide space for datasets
also the root of the ZFS file system hierarchy

– https://openzfs.github.io/openzfs-docs/man/7/zfsconcepts.7.html

grahamperrin · July 8, 2023, 7:24pm

How can I measure the amount that’s used for indexing?

TIA

Postscript

Re: the first comment beneath this post,

% apropos arcstat
arcstat(1) - report ZFS ARC and L2ARC statistics
% which arcstat
arcstat: Command not found.
% man zfs-stats
No manual entry for zfs-stats
% zfs-stats
Usage: /usr/local/bin/zfs-stats [-ABDHLMSabdhus]

-a      : all statistics (-IMAELZDs)

-I      : general system information
-M      : system memory information
-s      : sysctl ZFS tunables
-d      : descriptions for ZFS tunables
-t      : alternate sysctl layout (arc_summary.pl)

-A      : ARC summary
-E      : ARC efficiency
-D      : VDEV cache statistics
-L      : L2 ARC statistics
-O      : dataset operations statistics
-Z      : DMU (zfetch) statistics

-R      : display raw numbers and bytes
-V      : display program version and exit
-h      : display this (help) message

example: /usr/local/bin/zfs-stats -a

Additionally this command can give you information
on another computer through ssh.

--hostname  : address or hostname of the remote computer
--port      : port of the ssh server on the remote computer
--user      : user name on the remote computer
--id_file   : path to the identity file to use

example: /usr/local/bin/zfs-stats -a --hostname server.local --user somebody
% pkg provides /usr/local/bin/zfs-stats
Name    : zfs-stats-lite-1.4
Comment : Display human-readable ZFS statistics
Repo    : FreeBSD
Filename: usr/local/bin/zfs-stats

Name    : zfs-stats-1.3.1
Comment : Display human-readable ZFS statistics
Repo    : FreeBSD
Filename: usr/local/bin/zfs-stats
% pkg iinfo zfs-stats
zfs-stats-1.3.1
% uname -aKU
FreeBSD mowa219-gjp4-8570p-freebsd 14.0-CURRENT FreeBSD 14.0-CURRENT #0 main-n263900-cd9da8d072e4-dirty: Sat Jul  1 23:55:17 BST 2023     grahamperrin@mowa219-gjp4-8570p-freebsd:/usr/obj/usr/src/amd64.amd64/sys/GENERIC amd64 1400092 1400092
%

FreeBSD bug 267454 – sysutils/openzfs: arcstat(1) has a manual page, but no binary; and a peculiarity with pkg-provides(8)