ZFS import failures caused by non/existent segments in range trees

My questions:

  1. Does the below error suggest a duplicate entry in the spacemap or a duplicate free range allocation? (I have only a basic knowledge of ZFS internals)
  2. Did using zfs send/recv from one pool to another pool copy the duplicate allocation? (Is this because send/recv copies at the block level?)
  3. Now that I can import the pool after setting some ZFS tunables, will ZFS’ condensing operations ‘fix’ this duplication?

Background:

After a day of power outages (guess who has now set up a UPS) my server would panic when it was importing one of my pools.

panic: Solaris(panic) zfs: adding existent segment to range tree (offset=offset=2d8035c000 size=1000)

<and more of a backtrace>

Six passes of Memtest showed that my memory was fine, so I assumed that the power outages had somehow caused a data error.

I was eventually able to import the pool readonly=on and use zfs send/recv to re-create the pool on two new disks.

This new pool (new disks, connected to the motherboard SATA instead of the HBA) worked for a day and then started showing the same error.

I was able to import the pool after setting these sysctl options:

vfs.zfs.spa.load_verify_data=0
vfs.zfs.spa.load_verify_metadata=0
vfs.zfs.recover=1
vfs.zfs.zil.replay_disable=1

I could then run a scrub that returned no errors, but noticed that when I came across a similar error when I ran zdb:

sudo zdb -AAA -b FastPool
Password:

Traversing all blocks to verify nothing leaked ...

loading concrete vdev 0, metaslab 91 of 116 ...WARNING: zfs: removing nonexistent segment from range tree (offset=2d8035c000 size=1000)
loading concrete vdev 0, metaslab 115 of 116 ...
96.8G completed (6760MB/s) estimated time remaining: 0hr 00min 00sec        leaked space: vdev 0, offset 0x2d8035e000, size 4096

	No leaks (block sum matches space maps exactly)

	bp count:               4010548
	ganged count:                 0
	bp logical:        154767922688      avg:  38590
	bp physical:       103859914752      avg:  25896     compression:   1.49
	bp allocated:      108874661888      avg:  27147     compression:   1.42
	bp deduped:                   0    ref>1:      0   deduplication:   1.00
	bp cloned:                    0    count:      0
	Normal class:      108874633216     used: 44.09%
	Embedded log class          12288     used:  0.00%

	additional, non-pointer bps of type 0:     327594
	Dittoed blocks on same vdev: 390548
	Dittoed blocks in same metaslab: 2

Summoning @allan for this one…

Hi,

Did you have any luck resolving this? I’ve been having a terrible time because of the same issue.

It seems like there is little interest in fixing it even though this has been an issue for years unfortunately, so I’m starting to wonder if ZFS is fit for production use.