Write only error after scrub?

jblondreddit · August 14, 2023, 6:57am

I get this write error on my proxmox Server after a scrub. However, the last line says no errors. I am a bit confused.

ZFS has finished a scrub:

   eid: 28
 class: scrub_finish
  host: pve-04
  time: 2023-08-13 00:24:14+0200
  pool: rpool
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
	attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
	using 'zpool clear' or replace the device with 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
  scan: scrub repaired 0B in 00:00:13 with 0 errors on Sun Aug 13 00:24:14 2023
config:

	NAME                                                     STATE     READ WRITE CKSUM
	rpool                                                    ONLINE       0     0     0
	  mirror-0                                               ONLINE       0     0     0
	    ata-SAMSUNG_MZ7KH240HAHQ-00005_S47LNA0T204189-part3  ONLINE       0     5     0
	    ata-SAMSUNG_MZ7KH240HAHQ-00005_S47LNA0T204188-part3  ONLINE       0     0     0

errors: No known data errors

kaihp · August 14, 2023, 12:05pm

ZFS detected an error on the first disk (S47LNA0T204189), fixed it with the content from the other side of the mirror (S47LNA0T204188) and moved on in life.

I would check the 189 disk with at least a long SMART test (smartctl -t long /dev/sdXXX) and possiby also a full block test. You might have to run that under Windows using Samsung’s Magician software. Note that you have to say hdparm -B 254 /dev/sdXXX to avoid the sleep timeout (which cancels the SMART run).

mercenary_sysadmin · August 14, 2023, 10:28pm

Yes, this is confusing, I agree. And you’ve already gotten one well-meaning but not-quite-correct answer.

So, here’s the deal: the CKSUM column, and only the CKSUM column, tells you when you’ve got errors in data. The READ and WRITE columns are for when you have hardware-level I/O errors–not “I asked for a block, and I got one back but its checksum didn’t pass” but “I asked for a block, and the disk said no way dork, go away.”

So, what you’re looking at here is a transient error when five write requests to one of your two drives were met with hard I/O errors. But since you don’t have any CKSUM errors, that means that the drive in question did eventually manage to fulfill those 1-5 write requests (we don’t know if it was five unrelated requests, or a single block it tried five times to write before finally succeeding on the sixth).

I wouldn’t generally be too worried about it. Keep an eye on it, sure, but you should be doing that anyway. If it happens again, swap out the SATA cable for a new one (a NEW one, not one you found languishing in the bottom of a drawer) and see if the errors go away. If it happens again, you can either try swapping the cable to a new SATA port, or you can decide to preemptively replace the SSD in question (understanding that the problem still might not actually be the SSD, it could be the SATA controller).