Question = can adding a disk and creating a mirror for an existing single disk pool which has experienced READ, WRITE, and CKSUM errors lead to corrupt data in the mirror?
Setup is:
tankpool at homerentpool in a remote data center- Datasets are using native zfs encryption
tankhas been running for a couple years as a mirror and has never thrown any READ, WRITE, or CKSUM errors. Scrubs run on it monthly without issue or error.
# Tank
root@home:~# zpool status tank
pool: tank
state: ONLINE
scan: scrub repaired 0B in 23:27:02 with 0 errors on Tue Sep 16 12:24:05 2025
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
sda ONLINE 0 0 0
sdb ONLINE 0 0 0
errors: No known data errors
# Rent
user@remote:~$ zpool status -v rent
pool: rent
state: ONLINE
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
scan: scrub repaired 0B in 1 days 09:13:11 with 0 errors on Mon Sep 15 09:37:15 2025
config:
NAME STATE READ WRITE CKSUM
rent ONLINE 0 0 0
sda ONLINE 0 0 2
errors: No known data errors
Data is replicated via syncoid:
syncoid --sendoptions="w" --no-privilege-elevation --no-sync-snap -r tank/ds1 user@remote:rent/ds1
rent was not reporting any corrupt data files via zpool status -v.
To create resiliency, I added a second disk to rent to create a mirror vdev, and resilvering began during which I observed:
user@remote:~$ zpool status
pool: rent
state: ONLINE
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Thu Oct 2 18:11:18 2025
126G / 8.79T scanned at 572M/s, 31.2G / 8.79T issued at 141M/s
31.2G resilvered, 0.35% done, 18:03:10 to go
config:
NAME STATE READ WRITE CKSUM
rent ONLINE 0 0 0
mirror-0 ONLINE 1 0 0
sda ONLINE 1 0 0
sdb ONLINE 0 0 1 (resilvering)
errors: No known data errors
A READ error leads to a CKSUM error on the other disk being resilvered?
At the completion of the resilver:
user@remote:~$ zpool status
pool: rent
state: ONLINE
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
scan: resilvered 8.81T in 15:41:41 with 0 errors on Fri Oct 3 09:52:59 2025
config:
NAME STATE READ WRITE CKSUM
rent ONLINE 0 0 0
mirror-0 ONLINE 90 0 0
sda ONLINE 98 0 0
sdb ONLINE 0 0 90
90 READ errors on the source disk, 90 CKSUM errors on the disk added to the mirror and resilvered.
Then I triggered a scrub of rent, after which:
user@remote:~$ zpool status
pool: rent
state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
Sufficient replicas exist for the pool to continue functioning in a
degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
repaired.
scan: scrub repaired 43.0M in 13:18:07 with 0 errors on Fri Oct 3 23:54:32 2025
config:
NAME STATE READ WRITE CKSUM
rent DEGRADED 0 0 0
mirror-0 DEGRADED 90 0 0
sda FAULTED 230 2 0 too many errors
sdb ONLINE 0 0 90
errors: No known data errors
Now the scrub is making repairs? zpool status -v does not report any corrupted or damaged files, though.
Questions:
- Did
sdaonrenthave a risk of any actual corruption, or did those READ errors simply indicate “disk not healthy” but then a retry of the READ worked. - If there was a risk of corruption, am I correct in assuming that adding
sdbtorentwould copy any corruption to that new disk in the mirror? Is the pool actually protecting me from a failure ofsda? - During the resilver, the synchronization between READ errors on the source disk and CKSUM errors on the newly added disk - is that related? If there was a READ error, the relevant CKSUM would be wrong, leading to those CKSUM errors?
Appreciate any help.
I believe a bad SATA cable and controller have been ruled out. Next step is a smart long test on sda. If the result indicates a failing drive, I’m wondering if the right course of action is to replace sda, nuke the pool, create a new mirror pool, and try sending the full datasets again from tank.