Introduction: Ifaz Kabir

Hi all, Ifaz here!

Just joined and thought I’d introduce myself. I’m a software developer, with Go and Rust experience. I also did some graduate studies on Programming Languages.

Been using ZFS for around a year and a half now, and Proxmox for a year. Have learned a lot during the whole process. Lost some data in the process as well, so I have some stories to share!


Story time!

Last year, I bought a machine with 4-sata bays and 2 nvme bays. Had some old small hard drives, and bought a new 8TB one. Also bought an nvme drive to be my boot drive. I hadn’t used anything with slide out bays before (foreshadowing) and I had a lot of fun loading all my drives in! I initially installed Pop!_OS COSMIC Alpha, added the zfs utilities and got going. The Cosmic desktop at this point was extremely buggy, but I was mostly sshing into the machine for my dev work, so it wasn’t a big deal. And the ubuntu base worked flawlessly.

Went through several iterations, and tried out different vdev configurations, and set up scrubs, and I thought nice I finally had storage redundancy. But a month or so in, my scrubs started failing. The 8TB drive was showing smart errors, and even the mirrors of my old drives were failing and showing IO errors. And I lost some files in the process. But it was amazing how zfs showed which individual files I had lost. Most of the files I lost were not needed anymore anyway, so I got mostly lucky!

I returned the 8TB drive, but I suspected it was the the machine and not the drives. I couldn’t afford to send it back since I didn’t have anything else that ran Linux and I needed Linux for work, so couldn’t really afford to return the unit just yet. The boot drive was working just fine, so I moved all the files to the boot drive, and didn’t connect any SATA drives for a bit. Then a few months later, I remembered the Linus Tech Tips video about their ZFS data loss and remembered that they had a backplane issue, and realized it was probably the backplane. I took apart the machine, and sure enough the backplane wasn’t connected the the board properly.

Even though I fixed the connection, I wasn’t ready to trust the backplane just yet. I was using a 2TB drive as my Timemachine backup, and I put that in. That drive ran for months without any issues. Earlier drives were having issues a few days or weeks in. The 2TB drive did eventually die, but it was the drive this time around, and ZFS was warning that the drive should be replaced. This time it was the drive, I’m using other drives now, and the the scrubs have been clean for months (not sure if it has been a full year under the current setup).

Mistakes made, including ones not in the story:

  1. Buying an 8TB hard drive. The price for those make zero sense.
  2. There were lots of vdev setup mistakes as well, hard to tell as part of a story. But these are unavoidable parts of the learning process. It’s hard to get a feel for what makes sense till you deal with issues. Currently my pools have single mirror vdevs.
  3. There were lots of dataset mistakes along the way as well, using root datasets, etc.
  4. At one point I tried to use Cockpit for web based management. Cockpit has really heavy cpu usage when using it, which made me switch to Proxmox.

Lessons learned:

  1. When you get a new machine, take it apart and press in all the connections. Things disconnect during shipping!
  2. Setup backups first and then worry about redundancy.
  3. Make sure how your hard drives are connected don’t have a single point of failure. Currently connecting some of my drives over USB (I know, I know) for connection redundancy.

Fun fact, I’ve been porting syncoid to Rust as a way of learning all that it has to offer. I’m midway through the process, but damn syncoid deals with a lot!

3 Likes

Welcome, ifaz! You gonna do Sanoid after you’re done with syncoid? :cowboy_hat_face:

I considered a rewrite, but every time I looked at the work I’d need to do to handle the hashes without perl and the incredible set of libraries available for it, I noped right the hell back out :rofl:

Yeah, doing sanoid as well was always the plan! The repo is named after the sanoid port GitHub - ifazk/chobi: Zfs snapshot and replication tools. :grin: (Chobi means photo/picture, and chithi means letter).

I’m pretty far along for the syncoid port (chithi). Just added the bookmark fallback for syncing a couple days ago, and currently working on bandwidth limits. So pretty confident that I’ll at least finish the syncoid port.

I’ve added a few things from the What do you use to manage/schedule syncoid replication? discussion to the road map, so there’s a bit of a feature creep there. :sweat_smile:

The whole thing is a bit of a hedge against Perl dependence tbh, so there are some strong motivations for me to finish both. But motivations come and go, so we’ll see!