Do checksum errors always originate from a bad drive/bitrot

EpycFanboi · July 15, 2024, 7:03pm

I have understood that backplane and connectivity issues can cause read and write errors but what about checksum errors in zfs? If I get checksum errors it apparently means that data was read from drive but was invalid. Are there some scenarios (excluding ram) that could cause successful reads but checksum failures?

mercenary_sysadmin · July 15, 2024, 9:02pm

Literally anything that either corrupts data on-disk or in-flight will result in checksum validation errors.

cosmic ray flips a bit literally on the platter
cosmic ray flips a bit in flight (eg as it passes across a SATA cable)
magnetic medium and/or cell charge degradation flips a bit on the device
faulty controller
faulty cable
power issue

In some of the above cases, there are also considerably weaker checksums in play at levels beneath ZFS: for example, a bit flip in a SATA cable will typically be caught by the controller, because there is an extremely weak checksum algorithm in play as data is sent over the cable. Similarly, there’s a weak checksum in hardware on modern storage devices. With that said, those weak checksums are sufficient to detect most errors and force a re-read which may or may not give better results; they are definitely not sufficient to prevent hash collisions from being quite common in badly deranged gear.

phoenix · July 19, 2024, 6:02pm

I had a case where I had a raidz1 volume with 3 disks. The scrub regularly revealed checksum errors. It took me almost a month to figure out what was going on - a faulty SATA cable.
Replaced the cable, checksum errors were gone.