Questions re: Optimizing SCRUB and SMART Test Schedules for HDD and NVME SSD Pools (Home/Small Office)

SinisterPisces · July 5, 2025, 12:25am

Hello,

I have two pools:

HDD Pool: 4x Mirror VDEVs with 14TB enterprise HDDs; and
SSD Pool: 1x Mirror VDEV with 2x 4 TB NVME.

I’m trying to optimize my SMART test and SCRUB schedules.
Currently, they look like this:

SCRUB (per pool): Sunday at midnight, every 35 days.
LONG (per pool): Once a week, Wednesday, at 7 PM.
SHORT (per pool): Daily at 4 PM.

Some questions:

Given the size of my disks, do I have the SCRUB set far enough apart from the LONG HDD tests to minimize the possibility of them running at the same time? I guessed at this scheduling, to be honest.
I’ve seen suggestions that it’s sufficient for home/small office use to do the LONG HDD tests once a month instead of once a week, especially since ZFS adds another layer of health checks. Good idea/bad idea?
I’ve also seen suggestions that running a LONG test on an NVME isn’t really beneficial enough to be worth it. I’m (vaguely) assuming that if they’re going to have problems, a SHORT test is sufficient to detect them combined with ZFS’ health checks. (Also, every NVME vendor seems to implement SMART their own special way, so maybe that has something to do with it.)

I’d really appreciate some advice so I could settle on a strategy and stop thinking about it. Thanks!

HankB · July 5, 2025, 1:47pm

When I wish to sequence operations, I put them in a script so that when one finishes the next one runs. In your situation I’d use the -w (wait) flag for zpool scrub that will cause it to wait for the scrub to complete and then trigger the SMART self test.

Does this answer the question you are asking?

SinisterPisces · July 5, 2025, 3:14pm

That’s helpful, for sure.

But at this point I’m more focused on the scheduling: how often should I run SHORT and LONG SMART tests for HDDs and NVME SSDs, in particular.

Right now, I’m doing weekly LONG tests on the HDDs, and I wonder if that’s really necessary or if I could go monthly.

HankB · July 5, 2025, 3:30pm

I watch SMART stats and almost never run the tests. When I purchase a used HDD, I’ll run the SMART tests before running badblocks. I don’t think I’ve ever run SMART tests on an NVME or SATA SSD.

mercenary_sysadmin · July 5, 2025, 4:34pm

About once every fifth Tuesday of Nevruary, in my personal opinion. I don’t put a lot of faith in SMART tests; in the space of my IT career, I’ve seen maybe five or six drives fail SMART tests–and maybe two or three hundred drives fail.

SinisterPisces · July 5, 2025, 8:17pm

Oof. That would certainly kill my confidence.

Due to price-per-TB considerations, I’ve been buying used enterprise SATA SSDs and SATA HDDs for the last several years, from enterprise liquidators with limited warranties to cover stuff that just dies immediately.

In the last five years, I think I’ve had 3 disks (2 new, 1 used enterprise) fail SMART tests out of a couple dozen that I bought. They never stopped working, but I did use the SMART results as grounds to get warranty replacements. This was on a system that didn’t have ZFS.

I assume that ZFS error-checking will pick up any actual failing drives even with no SMART tests? TrueNAS ships with no SMART testing enabled, so ZFS must be expected to do something that picks up on a dying drive before it tanks. What sort of things actually get zpool status to throw a Drive Degraded warning?

mercenary_sysadmin · July 5, 2025, 8:30pm

zpool status shows you read I/O errors, write I/O errors, and blocks whose data doesn’t match its CKSUM. IMO that’s pretty much all you need.

This is the best way to use SMART. If and I repeat if a disk you just acquired fails SMART testing, well, definitely don’t throw it in your pool, and definitely do use that as grounds to have it replaced. But I don’t see much value in continued SMART testing once the drive is in use, due to the utterly horrendous rate of drives failing IRL before ever failing SMART.

SinisterPisces · July 5, 2025, 9:00pm

Good to know. Now that you’ve said that, I’m guessing those statistics get updated in real time with every I/O operation, so problems should become apparent fairly quickly?

(Assuming TrueNAS actually gives a notification for “zpool status is mad.”)

mercenary_sysadmin · July 5, 2025, 11:41pm

Correct, zpool status will show the result INSTANTLY if an error is encountered.

I have no idea about how well TrueNAS monitors anything. For me, I use sanoid --monitor-health.

SinisterPisces · July 6, 2025, 1:37am

I’ll follow up on the TrueNAS side of things and see what it’s doing for zpool-based notifications.

I don’t exactly want to simulate some sort of pool degradation to test it.

HankB · July 6, 2025, 5:03pm

Where’s the fun in that?

When I want to see how something works, I create a setup to test using disk files as VDEVs. You can see some of what I’ve done at https://github.com/HankB/Fun-with-ZFS. It might be tough to simulate I/O errors but I think you could use something like dd to corrupt something on disk (if you can figure out where it is stored) to see how ZFS responds.

For some definition of “fun.”

SinisterPisces · July 6, 2025, 5:31pm

I have one storage server at my disposal, with irreplaceable data.

Is it backed up? Yes.
Am I inclined to experiment on my production storage server? Not so much.

(Setting up a test TrueNAS environment is on my list, but pretty low priority. I don’t have the budget or cash or space to do it the way I want at the moment.

ZeroSignal9 · August 7, 2025, 1:40pm

I just had a drive pass an extended SMART test. About four days later it died. My system emailed me about a few pending sectors, a couple unrecoverable errors, then it completely dropped off the system shortly after. After a reboot, click click click…