I didn’t want to hijack the topic where this post appeared, so I’m starting a new one.
Like many on here, I use Seagate HDD and my main server is currently running only used disks, not recertified or new. All purchased at the same time, from the same supposedly reputable vendor. I’ve been messing around with SMART tools and tests on my disks and trying to judge what are critical errors and which (if any) of the disks need replacing.
According to the calculator linked in the post above (found here to reduce your clicking), this would indicate that despite what my SMART tests are showing, I don’t actually have any errors occurring on any of my disks. With the caveat that I hope I’m inputting the correct number into the calculator.
Should I worry about accumulated power on time?
Two of my 9 disks have nearly 53,000 hours
Two have ~12,000 hours
The other 5 disks all have just over 33,000 hours
Should I worry about accumulated start-stop cycles?
Two disks have just under 200 s-s cycles
Four disks have ~90
Other three fall in between those
Should I worry about non-medium error count?
Two disks have 7 unrepaired non-medium errors
One disk has 4
One disk has 1
Five disks have zero non-medium errors
Should I worry about total bytes of processed data? (all disks are currently a measly 4Tb each)
Three disks have processed nearly 350Tb of data each for combined read(~70%)/write(~30%)
All other disks are approximately 225Tb of data processed
I’m currently running long tests on the two worst disks, but they won’t be completed for about 8 hours.
I buy used drives and have had pretty good luck with them. Most come with about 5 years of power on hours and I suppose that’s where enterprises start replacing them.
I watch 5 Reallocated_Sector_Ct and 197 Current_Pending_Sector which show a failing drive. I also look at 199 UDMA_CRC_Error_Count which can indicate cabling/controller issues. And of course always 194 Temperature_Celsius to make sure that drives are getting decent airflow.
9 Power_On_Hours is interesting when I want to rotate a drive intro the array. I normally have a cold spare sitting on a shelf and ready to go. About every 6 months or so, I’ll swap it with the drive in the array with the highest hours. If I plan to increase the capacity of the RAID, I’ll swap in a larger capacity drive. Someday when all drives have been replaced with larger drives, I can resize the array. In the mean time, I take comfort that the resilver - which supposedly stresses the array - always works. IMO a scrub stress it more because it requires reading ALL blocks in use rather than just enough to recreate the new drive. I use Debian which schedules monthly scrubs and periodic scrubs are critical to array health.
TBW matters more for an SSD than an HDD. My pools serves local backup needs and are probably lightly loaded compared to the enterprise environment where the drives served their first years of their lives.
Should I worry about accumulated start-stop cycles?
This is where it starts to get interesting. Start-stop cycles aren’t hard evidence of problems, but they contribute MUCH more to the likelihood of problems developing than power-on hours do.
I would pretty confidently expect your 200+ drives to be the first ones to go, but I wouldn’t necessarily be looking for them to go anytime soon unless I had other evidence of problems with them.
Should I worry about non-medium error count?
A bit, but anything in the single digits is unlikely to really show much of a problem.
Should I worry about total bytes of processed data? (all disks are currently a measly 4Tb each)
Definitely not. Rust drives don’t really have issues with total write count–in theory, they would, but in practice, something else is going to cause a failure long before that happens. Where stats of this kind do come into play is with SSDs, which have sharply limited write cycles (typically expressed in TBW, TeraBytes Written).
Thanks @mercenary_sysadmin and @HankB. I appreciate the replies. From everything I could tell, it seems that at the moment, none of my drives have given any strong, individual sign of failure. Not to mention, thus far, all of the SMART tests have come back fine.
I just wanted to check with people with more experience to help verify what I’m seeing. I’ll probably pick up a couple of spares to have on the side, but feel better about the current status. Cheers.
I am using 2 drives in a mirror with backups to cloud and another ZFS pool (one disk) plus drive at work.
With the above in mind, I keep drives going until they fail according to ZFS. I am 100% reliant on the brilliance of ZFS to monitor for checksum errors and, once occurring, I order a new drive for the pool and that’s it.
Given your drive has run a long time, it will likely continue to do so because it has survived infant mortality (hate this phrase).
I refer people to Blackblaze posts on this topic. It’s important to note that although SMART is useful, roughly 23% of drive fail without any SMART warnings.
While no single SMART stat is found in all failed hard drives, here’s what happens when we consider all five SMART stats as a group.
I love this topic and keep and close eye on ZFS status and SMART stats. I am more worried about silent corruption of data than a hard drive failing, hence why I love ZFS.
I’m happy to help. I also like to have spares on hand so I can respond should problems crop up. I also have multiple local backups and one off site backup of my most important stuff.
I also like to have a monitoring system in place because I get complacent when everything is running well and neglect to monitor manually. Jim Salter likes Nagios and I prefer Checkmk (which is actually a fork of Nagios.) There are other automated solutions and I recommend having something. At a minimum, configure zed to perform periodic scrubs and email notification of same. That way you get a periodic reminder that the notification process is working. [1] Checkmk also monitors pools and lets me know if there are any issues with pools.
[1] For a while I was using a Seagate SSD as a boot drive in my file server and it put out SMART stats in a format that smartmontools didn’t understand and resulted in a daily email notifying me of a failed drive. It wasn’t failing but confirmed that the notification process was working.
I’m currently using an Intel 2.5" nvme as my boot drive and similarly, it doesn’t support many SMART stats/tests. Aside from the relatively high power-on hours (~68,000), what data I can see for it indicates it’s in a nearly “new” state. Very few power cycles, even fewer shutdowns, and less read/write data processed than the disk capacity.
I appreciate that a disk can fail at any point and that nvme and ssd disks often fail without warning. I had a brand new 2tb WD Black randomly fail several months after purchase (first and last purchase of WD products - I had avoided them for many years, but the sale was too big to pass up and I needed a disk for my laptop). That being said, this being an “enterprise” disk, gives me slightly more hope that it will go on for quite awhile.
Power cycle events aren’t much of an issue for solid state drives, just rust drives–the issue is the starting current in the coils of the electric motors that spin the platters on rust disks, which SSDs obviously don’t have.
Make sure you’ll survive the loss of that Intel drive, should it happen. I’ve seen particularly high “for no apparent reason” failures out of Intel SSDs over the years. I am not repeat not recommending you proactively replace it, just make sure that if and when it dies, you’ll be okay.
I don’t remember where I learned it, as it was quite a few years ago (15+), but I keep my base OS on my servers on a separate drive from everything else. So if that dies, it’s irrelevant to me.
In this case, I pop in a new drive, provision Ubuntu, install zfsutils-linux, import the pool (or likely force import it if the drive fails), and Bob’s your uncle.
If, or when I get to that point, do you have any recommendations for nvme or SSDs to use? Besides general consumer? Or maybe I’m overthinking it and they are good enough, since this obviously isn’t enterprise scale, nor stress.
Glad to hear you’re fine with losing the OS drive. I run into a lot of people who are definitely not okay with losing their OS drive, as their OS ends up being something deeply personalized and kinda… nested… over the years that they have no idea how to recreate.
Recommendations for SSD? I use and highly recommend Kingston DC600M SSDs. They’re enterprise-grade, available in sizes up to 7.68TB, not all that much more expensive than “prosumer” drives like Samsung 8xx Pro, but with double the write endurance per TB and hardware QoS that makes their performance much steadier and more reliable under heavy load. (They also don’t lie about their sector size in extremely unhelpful ways like the Samsungs do.)
I don’t really have much in the way of NVMe recommendations; NVMe M.2 is pretty much all consumer-grade garbage and I get a sour taste in my mouth talking about it. The specification is extremely fast but the drives themselves very, very frequently are not, in real-world applications. Basically, if you want good NVMe you’re looking for the U.2 form factor, which means you also need a U.2 controller, and you probably don’t want to dive super far down that rabbit hole.
With that said, I’m not above using an NVMe drive if it just falls into my lap–just only for that throwaway OS you’re describing, and yes, I view the loss of an OS drive (even on my desktop) like you do. (I also have backups of everything I care about, automated and monitored, so hey.)
Of those, I’ve been pretty happy with a Hynix 2TB drive in my current workstation, and I wasn’t mad at WD Black NVMe M.2 drives. Nothing else particularly stood out, especially since “this cheap consumer drive is a massive disappointment that doesn’t even faintly live up to the crap the vendor is claiming for it” doesn’t particularly stand out, if you catch my drift.
That’s what this Intel drive is. And this server is currently the only thing have that is setup for u.2. I swore I’d read that the Intel U.2 weren’t bad, but I guess it was bad information.
Haha the thought had crossed my mind several times over the past year, even before this conversation, here.
I don’t really care much about consumer grade nvme, for exactly the reasons you state. Granted, I will say, I currently have a Seagate Firecuda in my workstation that I got on a great deal after the holidays and performance has been surprisingly decent.
Now that you mention them, I recall you talked about them a couple of times on 2.5 Admins. I’d just forgotten.
I started doing that with my “daily driver” computers (not just my server) about 6 or 7 years ago when I got my first laptop with space for two drives. Then did that when I built my desktop. It takes a lot, if not most, of the stress out of having a computer, in my opinion.
My backups aren’t automated (nor monitored) on my desktop/laptop. That’s something new for me since the other topic here, related to setting up sanoid/syncoid on my main server. Now that I have that figured out successfully on the server, I may well try it out on my desktop.
I have zero experience with the Intel U.2 drives. It’s at least possible, and perhaps even plausible, that those would be good despite their consumer drives having been trash for quite a while.
That much I had known. Now I understand (or at least am more confidently guessing) that their consumer drives are what you’d referred to in your reply previously.
Yeah, after the experiences I had with Intel’s Cherryville line I had zero interest in testing their enterprise gear. I tend to subscribe to the school of thought that if a vendor is pushing trash in one product line, you shouldn’t trust them much in their other lines either.
True. This is my I love Linux. I have recently upgrade desktop and simply reinstalled Mint (with ZFS of course) and restored my home directory, quick restart and voila! - everything is back as it was. After installing packages I needed it was as if I never left. Try that on Windows.