#Hi, ZFSers!
If you’re new to ZFS, you’re probably not entirely solid on how the structure of the system works. This post is here to help, complete with lovely ASCII diagrams (that don’t require loading other links, since Reddit can’t embed images without use of browser extensions).
In word form, it’s fairly simple: The top level structure is the zpool. A zpool consists of one or more vdevs. Each vdev consists of one or more disks.
Standard vdevs are where ZFS stores your files. Support vdevs (all of which are optional) may accelerate your zpool by offering pool-wide services which decrease latency and increase throughput in various ways.
ASCII art time!
Before we go any further, let’s look at that promised ASCII diagram.
—————————————————————————————————————————————————————————
| top level: the zpool. zpools are self-contained. |
| A zpool contains zero or more support vdevs, which |
| service the entire pool as a whole, and one or more |
| standard vdevs, which contain the pool's data. |
| |
| Each vdev consists of one or more disks, and can have |
| one of several topologies: single disk, n-way mirror, |
| RAIDz1/2/3 (stripe with 1, 2, or three parity blocks) |
| |
| Support (optional) vdevs |
| -------- --------- ---------------------- |
| ! SLOG ! ! L2ARC ! ! special allocation ! |
| -------- --------- ---------------------! |
| |
| Standard (main storage) vdevs |
| ———————————————————————— ------------------------ |
| | —————— ------ ------ | ! —————— ------ ------ ! |
| | |disk| !disk! !disk! | ! |disk| !disk! !disk! ! |
| | —————— ------ ------ | ! —————— ------ ------ ! |
| ———————————————————————— ------------------------ |
| ------------------------ ------------------------ |
| ! —————— ------ ------ ! ! —————— ------ ------ ! |
| ! |disk| !disk! !disk! ! ! |disk| !disk! !disk! ! |
| ! —————— ------ ------ ! ! —————— ------ ------ ! |
| ------------------------ ------------------------ |
—————————————————————————————————————————————————————————
Beautiful, right? A box enclosed by |———| is mandatory; one enclosed by !----! is optional. So we can see the visually reinforced concepts outlined above: a pool contains at least one storage vdev, which contains at least one disk. And we might or might not have support vdevs.
Note: the diagram only shows up to three disks in each vdev and four vdevs in its pool, but that’s a visual limitation of the diagram, not an actual ZFS limitation. There are no hard limits to disks-per-vdev or vdevs-per-pool; although there are some practical limitations in real life, they’re nowhere near this small.
Fault tolerance: zpool level
There is no parity or redundancy at the pool level; all fault tolerance is at the vdev level. If you lose any storage vdev (or if you lose a special allocation vdev), you lose the entire pool along with it.
Let’s repeat that one more time: if you lose a vdev, you lose the whole pool! Think about that before you decide to skimp on parity or redundancy in any single vdev.
The exceptions are L2ARC and SLOG vdevs. The pool will still be functional after the loss of either L2ARC or SLOG (although if the SLOG contained data, and was lost after a system crash, the data in the SLOG will be lost).
NOTE: for the most part, writes are distributed among vdevs in proportion to the amount of free space each has available. This allows the pool to continue operating until all vdevs have become full, even when they are differently sized or unevenly balanced. There is some provision in newer ZFS versions for writing preferentially to lower-latency vdevs in some cases, but it should not be relied upon to materially alter the general write distribution behavior.
Fault tolerance: vdev level
Any vdev—including the support vdev types—can be fault tolerant, or non-fault-tolerant. It’s up to you, the zfs admin, to make good choices here.
These are your vdev topologies:
- Single disk vdev—no fault tolerance. Lose the disk, lose the
vdev; if thatvdevwas storage or special allocation, lose thezpoolwith it. - n-way mirror vdev—fault tolerance through redundance. Each disk in a mirror
vdevhas the same data on it as every other disk in the mirrorvdev. Usually seen in 2-way mirrors, but 3-way and up are entirely possible. The vdev survives as long as any single disk survives—but once it’s down to one disk only, it’s “uncovered”, meaning that any existing corrupt blocks are permanently irreparable, as well as any blocks which become corrupted while the mirror is down to a single disk. - RAIDzn vdev—all disks in the
vdevare striped, with n blocks of parity per stripe, where n=1, 2, or 3. If you lose n disks in thisvdev, it becomes “uncovered”. Losing n+1 disks makes thevdeventirely inoperable.
Support vdev types
-
L2ARC: The Layer 2 ARC is a simple read cache, which operates a step below the ARC (Adaptive Replacement Cache, which is in RAM). This
vdevtype is nowhere near as useful as many new users expect it to be—even a fast SSD is tremendously slower than RAM, and theL2ARCisn’t actually an ARC at all, it’s a simple ring buffer populated with blocks evicted from the ARC. While an L2ARC can be useful in certain circumstances, the odds are extremely high that it’s not useful in your circumstances. WARNING: indexingL2ARCeats system RAM; an over-large L2ARC can end up costing you far more RAM (which then can’t be used for ARC, standard filesystem cache, or applications) than it’s worth. Caution is strongly advised here. -
SLOG—the Secondary LOg Device is not, as commonly misunderstood, really a write cache. It’s just a really fast place to store the
ZIL(Zfs Intent Log). TheZILis a place on non-volatile storage where ZFS can immediately dump a copy of any dirty data present whensync()is called, in order to honorsync()'s request to ensure data in flight is crash-safe before the system does anything else. This only affects synchronous writes, and the data inZILis only read from after a system crash. In normal operation, sync writes are dumped immediately in theZIL, then written out as normal, later, inTXGs (TransaXtion Groups) onto main storage along with async writes. Having a SLOG on fast SSD storage can dramatically increase throughput and decrease latency in a workload with many sync writes; it won’t have any effect at all on a workload that’s already asynchronous. NOTE: aSLOGdoes not need to be any larger than the amount of data which might flow through the system forvfs.zfs.txg.timeoutseconds—5 seconds by default. But SSD endurance is measured in total drive writes per day, and aSLOGsees an enormous amount of writes on systems that need it: so you’ll want bare minimum 256GB for any non-Optane SLOG. (Optane can be smaller because it has massively higher write endurance than “normal” NAND flash SSDs.) -
special (allocation) vdev—the new
specialvdevclass, only present in ZFS 0.8 and above, allows for storage of system metadata (and, optionally, very small writes) separately from main storage. This allows extremely latency-sensitive operations (file and directory metadata, small database operations, etc) to be stored on thespecialtype vdev, which should be constructed of very fast SSDs. WARNING: if you lose aspecialallocationvdev, you lose the whole pool with it—we do not recommend using a single disk topology here!