Just an anecdote about how a single misbehaving drive can cripple a pool

bladewdr · June 16, 2024, 4:52am

I’m currently in the middle of a full replication onto my backup host, and I was wondering why it was taking so long. I started out yesterday saturating the gigabit link between my primary NAS and the backup, but today it was down to less than a quarter of that.

Tonight I dug into it a bit, and using iostat I noticed that exactly ONE of my drives in my pool (a 4 drive pool of two mirror vdevs) had an average queue length of between 6 and 7, and the utilization was constantly above 100%!

I offlined the drive from the pool and the speed of the replication immediately jumped back up to saturating the link.

I’ve been buying mostly refurbished Seagate Ironwolf drives for at home, but my luck with them of late has not been fantastic. (One disk failing within 3 months, several arriving DOA, and now this. This drive was installed less than a month ago.)

quartsize · June 17, 2024, 12:34pm

I do a silly amount of hunting down slow pool members with zpool iostat -wv.

bladewdr · June 17, 2024, 1:49pm

Interesting, I didn’t realize that zfsutils had it’s own iostat tool.

I was just doing the standard iostat on linux wrapped in a watch command.

mercenary_sysadmin · June 18, 2024, 1:07am

They’re both useful. Standard iostat is better at diagnosing drive issues; zpool iostat can give you some additional ZFS-specific information but loses some of the grittier hardware-level info.