#Hi, ZFSers!
If you’re new to ZFS, you’re probably not entirely solid on how the structure of the system works. This post is here to help, complete with lovely ASCII diagrams (that don’t require loading other links, since Reddit can’t embed images without use of browser extensions).
In word form, it’s fairly simple: The top level structure is the zpool
. A zpool
consists of one or more vdev
s. Each vdev
consists of one or more disks.
Standard vdev
s are where ZFS stores your files. Support vdev
s (all of which are optional) may accelerate your zpool
by offering pool-wide services which decrease latency and increase throughput in various ways.
ASCII art time!
Before we go any further, let’s look at that promised ASCII diagram.
—————————————————————————————————————————————————————————
| top level: the zpool. zpools are self-contained. |
| A zpool contains zero or more support vdevs, which |
| service the entire pool as a whole, and one or more |
| standard vdevs, which contain the pool's data. |
| |
| Each vdev consists of one or more disks, and can have |
| one of several topologies: single disk, n-way mirror, |
| RAIDz1/2/3 (stripe with 1, 2, or three parity blocks) |
| |
| Support (optional) vdevs |
| -------- --------- ---------------------- |
| ! SLOG ! ! L2ARC ! ! special allocation ! |
| -------- --------- ---------------------! |
| |
| Standard (main storage) vdevs |
| ———————————————————————— ------------------------ |
| | —————— ------ ------ | ! —————— ------ ------ ! |
| | |disk| !disk! !disk! | ! |disk| !disk! !disk! ! |
| | —————— ------ ------ | ! —————— ------ ------ ! |
| ———————————————————————— ------------------------ |
| ------------------------ ------------------------ |
| ! —————— ------ ------ ! ! —————— ------ ------ ! |
| ! |disk| !disk! !disk! ! ! |disk| !disk! !disk! ! |
| ! —————— ------ ------ ! ! —————— ------ ------ ! |
| ------------------------ ------------------------ |
—————————————————————————————————————————————————————————
Beautiful, right? A box enclosed by |———|
is mandatory; one enclosed by !----!
is optional. So we can see the visually reinforced concepts outlined above: a pool contains at least one storage vdev, which contains at least one disk. And we might or might not have support vdevs.
Note: the diagram only shows up to three disks in each vdev
and four vdev
s in its pool, but that’s a visual limitation of the diagram, not an actual ZFS limitation. There are no hard limits to disks-per-vdev
or vdev
s-per-pool; although there are some practical limitations in real life, they’re nowhere near this small.
Fault tolerance: zpool level
There is no parity or redundancy at the pool level; all fault tolerance is at the vdev
level. If you lose any storage vdev
(or if you lose a special allocation vdev
), you lose the entire pool along with it.
Let’s repeat that one more time: if you lose a vdev, you lose the whole pool! Think about that before you decide to skimp on parity or redundancy in any single vdev.
The exceptions are L2ARC
and SLOG
vdevs. The pool will still be functional after the loss of either L2ARC
or SLOG
(although if the SLOG
contained data, and was lost after a system crash, the data in the SLOG
will be lost).
NOTE: for the most part, writes are distributed among vdevs
in proportion to the amount of free space each has available. This allows the pool to continue operating until all vdev
s have become full, even when they are differently sized or unevenly balanced. There is some provision in newer ZFS
versions for writing preferentially to lower-latency vdev
s in some cases, but it should not be relied upon to materially alter the general write distribution behavior.
Fault tolerance: vdev level
Any vdev
—including the support vdev
types—can be fault tolerant, or non-fault-tolerant. It’s up to you, the zfs
admin, to make good choices here.
These are your vdev
topologies:
- Single disk vdev—no fault tolerance. Lose the disk, lose the
vdev
; if thatvdev
was storage or special allocation, lose thezpool
with it. - n-way mirror vdev—fault tolerance through redundance. Each disk in a mirror
vdev
has the same data on it as every other disk in the mirrorvdev
. Usually seen in 2-way mirrors, but 3-way and up are entirely possible. The vdev survives as long as any single disk survives—but once it’s down to one disk only, it’s “uncovered”, meaning that any existing corrupt blocks are permanently irreparable, as well as any blocks which become corrupted while the mirror is down to a single disk. - RAIDzn vdev—all disks in the
vdev
are striped, with n blocks of parity per stripe, where n=1, 2, or 3. If you lose n disks in thisvdev
, it becomes “uncovered”. Losing n+1 disks makes thevdev
entirely inoperable.
Support vdev
types
-
L2ARC: The Layer 2 ARC is a simple read cache, which operates a step below the ARC (Adaptive Replacement Cache, which is in RAM). This
vdev
type is nowhere near as useful as many new users expect it to be—even a fast SSD is tremendously slower than RAM, and theL2ARC
isn’t actually an ARC at all, it’s a simple ring buffer populated with blocks evicted from the ARC. While an L2ARC can be useful in certain circumstances, the odds are extremely high that it’s not useful in your circumstances. WARNING: indexingL2ARC
eats system RAM; an over-large L2ARC can end up costing you far more RAM (which then can’t be used for ARC, standard filesystem cache, or applications) than it’s worth. Caution is strongly advised here. -
SLOG—the Secondary LOg Device is not, as commonly misunderstood, really a write cache. It’s just a really fast place to store the
ZIL
(Zfs Intent Log). TheZIL
is a place on non-volatile storage where ZFS can immediately dump a copy of any dirty data present whensync()
is called, in order to honorsync()
's request to ensure data in flight is crash-safe before the system does anything else. This only affects synchronous writes, and the data inZIL
is only read from after a system crash. In normal operation, sync writes are dumped immediately in theZIL
, then written out as normal, later, inTXG
s (TransaXtion Groups) onto main storage along with async writes. Having a SLOG on fast SSD storage can dramatically increase throughput and decrease latency in a workload with many sync writes; it won’t have any effect at all on a workload that’s already asynchronous. NOTE: aSLOG
does not need to be any larger than the amount of data which might flow through the system forvfs.zfs.txg.timeout
seconds—5 seconds by default. But SSD endurance is measured in total drive writes per day, and aSLOG
sees an enormous amount of writes on systems that need it: so you’ll want bare minimum 256GB for any non-Optane SLOG. (Optane can be smaller because it has massively higher write endurance than “normal” NAND flash SSDs.) -
special (allocation) vdev—the new
special
vdev
class, only present in ZFS 0.8 and above, allows for storage of system metadata (and, optionally, very small writes) separately from main storage. This allows extremely latency-sensitive operations (file and directory metadata, small database operations, etc) to be stored on thespecial
type vdev, which should be constructed of very fast SSDs. WARNING: if you lose aspecial
allocationvdev
, you lose the whole pool with it—we do not recommend using a single disk topology here!