What happens in the event of a URE during a resilver?

TheDragon · August 16, 2023, 3:51pm

The title pretty much sums up my question, I’m using 20TB disks and planning my pool layout.

Given the size of the disks I understand that during a resilver it’s inevitable that one will occur.

I’m not sure I fully understand UREs in the context of ZFS resilvers.

Are UREs a specific portion of the disk that can never be read from again?
Or is it a case of during an attempt to read this specific portion of the disk that I didn’t work on the first attempt, butight/would on a subsequent attempt?

In terms of the actual resilver process, when a URE is encountered what would happen to the resilver process?
Would it stop, and allow me to “try again”? Or would it start resilvering from the beginning again? Or would it just resilver with specific file(s) unrecoverable and those files listed at the end?

Is this handled any differently in TrueNAS vs using ZoL via CLI?

I’m trying to understand this aspect fully so I can plan my pool layout in an informed manner.

EDIT: Thanks for the reply Jim, I’m can’t seem to find a reply button, so hoping you see this update

I came across this article

Which I’ll be completely honest, I don’t fully understand (but made me think I should understand the point being made given ZFS requires planning in advance, and retrospective changes not being easy.

So I’m talking about the disks that make up my pool, so if large drives (larger than discussed in this article) are almost certain to encounter a “URE” as described in this article, during a resilver of any pool layout what are the implications - this is what I’m trying to understand from the perspective of ZFS.

Does that make more sense?

mercenary_sysadmin · August 16, 2023, 4:11pm

“UnRecoverable Error” is a bit of a loose term that can apply in several different places–in this case, you could be referring to the entire pool, or to a single disk. I’m going to assume you mean a URE on a single disk, at the disk level.

If it’s simply a case of bad data, well, from ZFS’ perspective, that’s a simple CKSUM error and easily corrected. But if you’re referring to a bad sector that can never properly store or retrieve data to that bad hardware sector, it’s going to be up to your drive’s firmware to detect and remap that sector.

All rust drives have a certain number of unused sectors available for exactly this purpose, and the firmware in theory will automatically detect bad hardware sectors and remap a working sector from the drive’s (relatively small) amount of spare capacity into the same virtual address the bad sector occupied.

There is no guarantee, of course, that the remap process will actually work. It frequently, in my experience, doesn’t. But that’s the general idea. If the drive’s firmware doesn’t successfully remap a good sector from its small bank of spare capacity to replace the bad sector, well, then that drive is failed permanently and you must replace it. OTOH, if your drive does successfully detect and remap the sector, well, it’s fine–and depending on the drive model, your OS might not even know it happened.

If none of this answers your specific concerns, you’ll need to be more specific, please.

Pyrroc · August 17, 2023, 7:35pm

So in the context of the fact that you’re planning your pool layout, I’ll try to address this at a relatively high level. Keep in mind that because ZFS handles everything from the raw drives through the filesystem that a ZFS URE outcome very likely differs from a traditional RAID array. Classically, traditional RAID controllers would abort a rebuild if a URE occurred in a RAID5 rebuild as it could not determine where that sector fell in relation to the filesystem or LUNs presented.

Hard drives reliability ratings. One of these is generally called “Nonrecoverable Read Errors Rate” or as you supplied URE, Unrecoverable Read Error, rate. Consumer drives are generally rated at 1 per 10^14. Enterprise drives at 1 per 10^15, 10 times greater but this as you will see is mitigated in how they are used. This rating is in bits. Converting 1 per 10^14 bits to bytes results in 1 per 11.37 TiB read, so the rating says that you could expect to have an URE once every 11.37TiB read on a consumer drive. Since you’re discussing 20TB drives, I’m going to switch my explanation to enterprise-level drives.

Given that enterprise drives have a rating that is ten times better you might say, then unless I have an array that is less than 113.7 TiB in size, I’m golden. Unfortunately that is not the case because enterprise drives are generally grouped together to hold data.

To simplify things let’s talk about six-sided dice. If I have one die, then I have a 1 in 6 (16.7%) chance of rolling a six. If I have 2 dice then my chances of rolling at least one six are increased to 1 in 4 (25%). As I add more dice it becomes more likely that I will get at least one 6 every time I roll them. This is the equivalent of raidz1 or RAID5 that can survive one drive loss.

raidz2 or RAID 6 is like having to roll 2 sixes. With 2 dice my chances are 1 in 36 (2.7%) of rolling 2 sixes at the same time, much better than 1 in 6. With 3 dice it becomes 2 in 27 (7.4%). The probability calculations get really complex with additional dice so I won’t go further. Suffice it to say, the chances are significantly lower.

This also applies to the hard drive error rates. As I add more drives my chance of getting that URE increase. The chance of getting a URE can approach that of a single consumer drive. The chances of 2 UREs affecting the same exact sector on a pair of drives are something like 1 in 10^30. All of that is not to say that one couldn’t hit the lottery but it is extremely unlikely.

I’m going to use 20TB (18.19 TiB) enterprise drives in these scenarios. Three of the top-of-mind configs would be:

4+1 raidz1/RAID5 - 72.76 TiB usable
4+2 raidz2/RAID6 - 72.76 TiB usable
3 x 2-drive mirrors - 54.57 TiB usable

We’re going to talk about the likelihood for having a URE that results in data loss (important caveat) during the rebuild after a single complete drive failure.

Scenario 1, we have 4 drives left with no parity to spare. There is a chance that we could experience an URE and because we don’t have any parity to recover data from ZFS will register a data error on the file or zvol affected. That data can’t be recovered without outside intervention.

Scenario 2, we have 5 drives left with parity. Again there is the chance of URE but because we still have parity, ZFS can rebuild that data automagically with no data loss.

Scenario 3 is a little more tricky and the subject of many debates. With a mirror drive loss, its partner now has no redundancy. The URE chance is only once in 113.7 TiB, but we are putting that single drive under significant stress during that rebuild. Remember it takes 22+ hours to read 20TB off of a drive @ 250MB/sec. If we do hit the lottery and get a URE on the remaining drive ZFS will register a data error on the file or zvol affected. That data can’t be recovered without outside intervention.

So:

Scenario 1: bad… just say no (or possibly goodbye to your data)… FYI I personally use RAID5/z1 on a NAS that does nothing but store backups as I feel that I have enough data redundancy elsewhere to make up for it.
Scenario 2: safest, but not most performant
Scenario 3: more performant than scenario 2, but somewhat riskier.

ETA: All of this to also say: MAKE SURE YOU HAVE BACKUPS!
ETA2: Also, zfs will continue the resilver as best it can, just marking errors on files or zvols it can’t recover properly.