Hi folks! An article I wrote on the basics of monitoring just went live at Klara Systems. Including breadcrumbs leading those of you who don’t want to learn and run Nagios to a free to use easily-configured web service that Sanoid plugs into very nicely! ![]()
![]()
![]()
Since this article by @mercenary_sysadmin came out, I have had these healthchecks running for about 2 months now. This week I started getting notices from my healthchecks.io account about failed sanoid --monitor-health pings. I wouldn’t have been checking it otherwise, probably. It turns out one of my Samsung 980 Pro 1tb mirror devices was in “Removed” status. For some reason the system could not see it anymore. sudo fdisk -l showed nothing.
I tried a reboot and still the same. I then tried a system shutdown and then restarted and the device then came back online. Strange, I guess the device controller had a power hitch or something. I am wondering if I should trust it, or replace it. smartctl gives a good report, and the BIOS smart test passes it. zpool status does show a 3 in the checksum column, not sure what that means.
These things now require a 2nd mortgage on the home to purchase, good golly.
Well, cabling isn’t going to be the issue with an NVMe drive, obviously. Does this system have a UPS protecting it from power irregularities?
I do have a power backup, a Solix c1000.
Ok so we can eliminate environmental power issues.
For an NVME M.2 drive that spontaneously disappears from the system, that leaves us with the following potential failures, in rough order of decreasing likelihood:
- Drive
- Motherboard
- RAM
- PSU
- CPU
By far, drive is most likely out of the remaining potential culprits.