ZFS Reports device removed then finishes a resilver seconds later

vuduguru · June 4, 2024, 11:56pm

Perplexed by this. System seems to produce no other errors. Pool is lightly used as replication target nightly and seems to operate normally otherwise.

Details

System: Dell R730XD
HBA: Dell H710 mini monolithic 5CT6D with LSI 9207-8i P20 IT Mode
Disks: Seagate 16TB ST16000NM001G x 8
OS: Proxmox 7.4-16

Have replace HBA, Cables and Backplane.

Mirror consists of 8 disks in striped mirrors. 4 drives have individually reported these errors.
Any ideas appreciated?

zpool list
NAME      SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
storage  58.2T  6.98T  51.2T        -         -     3%    11%  1.00x    ONLINE  -

Example from the error email (4 separate emails):-

ZFS has detected that a device was removed.

 impact: Fault tolerance of the pool may be compromised.
    eid: 31323
  class: statechange
  state: REMOVED
   host: fs3
   time: 2024-06-05 08:31:54+1000
  vpath: /dev/disk/by-id/ata-ST16000NM001G-2KK103_xxxxxxLX-part1
  vphys: pci-0000:02:00.0-sas-exp0x500056b36789abff-phy13-lun-0
  vguid: 0x7204E976EA9A378E
  devid: ata-ST16000NM001G-2KK103_xxxxxxLX-part1
   pool: storage (0xC511B90B829B373C)

ZFS has detected that a device was removed.

 impact: Fault tolerance of the pool may be compromised.
    eid: 31333
  class: statechange
  state: REMOVED
   host: fs3
   time: 2024-06-05 08:31:56+1000
  vpath: /dev/disk/by-id/ata-ST16000NM001G-2KK103_xxxxxxLX-part1
  vphys: pci-0000:02:00.0-sas-exp0x500056b36789abff-phy13-lun-0
  vguid: 0x7204E976EA9A378E
  devid: ata-ST16000NM001G-2KK103_ZxxxxxxLX-part1
   pool: storage (0xC511B90B829B373C)

ZFS has finished a resilver:

   eid: 31331
 class: resilver_finish
  host: fs3
  time: 2024-06-05 08:31:55+1000
  pool: storage
 state: ONLINE
  scan: resilvered 192K in 00:00:01 with 0 errors on Wed Jun  5 08:31:55 2024
config:

ZFS has finished a resilver:  
  
   eid: 31340  
 class: resilver_finish  
  host: fs3  
  time: 2024-06-05 08:32:54+1000  
  pool: storage  
 state: ONLINE  
  scan: resilvered 240K in 00:00:01 with 0 errors on Wed Jun  5 08:32:54 2024  
config:

mercenary_sysadmin · June 5, 2024, 1:58am

If you’ve already replaced HBA, cables, and backplane, the next most likely SPoF to produce the symptoms you describe would be the PSU. If you’re dipping low on power occasionally while lighting up all eight drives maximally for a scrub, that could produce transient bus disconnects like you’re seeing here.

vuduguru · June 5, 2024, 2:48am

Thanks, Good point. Was focused on the disk system and hadn’t considered power delivery. Server is a Dell with dual 750W PSUs typically running 210 - 320 ish watts total. Which points to motherboard or power cables connecting MB to backplane or PSU to MB. IDRAC shows PSU’s as performing as expected.
Hmmmm.

vuduguru · July 9, 2024, 11:12pm

Interestingly, just last week upgraded from Proxmox 7 to 8 and errors for regular “ZFS has finished a resilver:” have stopped. Which is nice as I had been looking for a suitable replacement R730XD motherboard.

Had also swapped the PSU’s in the interim, without improvement.