Do checksum errors always originate from a bad drive/bitrot

I have understood that backplane and connectivity issues can cause read and write errors but what about checksum errors in zfs? If I get checksum errors it apparently means that data was read from drive but was invalid. Are there some scenarios (excluding ram) that could cause successful reads but checksum failures?

Literally anything that either corrupts data on-disk or in-flight will result in checksum validation errors.

  • cosmic ray flips a bit literally on the platter
  • cosmic ray flips a bit in flight (eg as it passes across a SATA cable)
  • magnetic medium and/or cell charge degradation flips a bit on the device
  • faulty controller
  • faulty cable
  • power issue

In some of the above cases, there are also considerably weaker checksums in play at levels beneath ZFS: for example, a bit flip in a SATA cable will typically be caught by the controller, because there is an extremely weak checksum algorithm in play as data is sent over the cable. Similarly, there’s a weak checksum in hardware on modern storage devices. With that said, those weak checksums are sufficient to detect most errors and force a re-read which may or may not give better results; they are definitely not sufficient to prevent hash collisions from being quite common in badly deranged gear.

3 Likes

I had a case where I had a raidz1 volume with 3 disks. The scrub regularly revealed checksum errors. It took me almost a month to figure out what was going on - a faulty SATA cable.
Replaced the cable, checksum errors were gone.

1 Like