What status fields should I be monitoring?

I’ve recently setup some monitoring using zpool status -j. Is it good enough to check that the following fields match their expected values?

.pools.<pool>.state
.pools.<pool>.vdevs.<pool>.read_errors
.pools.<pool>.vdevs.<pool>.write_errors
.pools.<pool>.vdevs.<pool>.checksum_errors

that’s all that Sanoid monitors from zpool status, so I certainly hope it’s enough! :slight_smile:

It depends if you are just monitoring for issues, or if you want metrics too.

Re: the per-vdev numbers: it may be valuable to monitor more than just the root vdev (the pool name), and instead get a heads up if any disk is having read/write/checksum errors.

With the per-vdev properties, you can also get stuff like the read/write rates per second per disk, to understand how busy the system is, and if the load has changed significantly.

1 Like

For the “if any disk is having read/write/checksum errors” case, does .pools.<pool>.state become DEGRADED? I’m checking for that the state equals ONLINE, and I set up alerting if it’s not ONLINE.

Also, do pools become DEGRADED if there are slow IOs?

That’s a good point, and I may not have read closely enough: sanoid is actually looking for read / write / cksum errors ANYWHERE in the output of zpool status: pool level, vdev level, and/or single disk level.