ZFS expansion weirdness

I’m new to the forum, but didn’t spot a duplicate based on a quick search. Hopefully this isn’t a FAQ, but it might be one in the future - I can’t be the only person that has this question.

I came here via the NixOS forum. Where I posted effectively this same question - however, I will repeat a summary here.

I’m playing with the recent support for ZFS expansion (zpool attach). Allowing you to add volumes to a RAIDZ array.

I’m playing in a VM - where I can rapidly re-provision things. It’s super nice. So I set myself up with a basic NixOS install - and add 3 blank data drives of 10G each.

These drives appear as /dev/vda /dev/vdb /dev/vdc

So we can make a ZFS RAIDZ array by doing…

$ sudo zpool create dpool raidz vda vdb vdc
$ df -h /dpool/
Filesystem      Size  Used Avail Use% Mounted on
dpool            20G  128K   20G   1% /dpool

Awesome – 20G of storage on top of 3 x 10G drives - I’m giving up 1 drive worth of space for parity.

Now, because I can quickly burn this to the ground and start again, I’ll do that and get myself back to having 3 x 10G blank drives… This time I’m going to create a very small ZFS RAIDZ with just 2 of the drives, then add a 3rd.

$ sudo zpool create dpool raidz vda vdb

$ df -h /dpool/
Filesystem      Size  Used Avail Use% Mounted on
dpool           9.5G  128K  9.5G   1% /dpool

No surprise here - but we now add the 3rd drive

$ sudo zpool attach dpool raidz1-0 vdc

$ df -h /dpool
Filesystem      Size  Used Avail Use% Mounted on
dpool            15G  128K   15G   1% /dpool

What? Why only 15G and not 20G?

Above was my original question - but - I think I solved things for myself, so my question for this forum is “are my conclusions correct?”

Here is what I concluded.

The amount of free space is just an estimate? And that estimate is based on the storage ‘efficiency’ of the ZFS filesystem as originally created.

For me this article was key to understanding what was going on.

By way of explaination, let me use a more extreme example - a 4 x 10G RAIDZ1 - which if created up front, would give me 29G of usable storage. What if we build that same 4 drive array, starting with 2 and then growing to 4 total?

$ sudo zpool create dpool raidz vda vdb

$ df -h /dpool/
Filesystem      Size  Used Avail Use% Mounted on
dpool           9.5G  128K  9.5G   1% /dpool

$ sudo zpool attach dpool raidz1-0 vdc

$ df -h /dpool/
Filesystem      Size  Used Avail Use% Mounted on
dpool            15G  128K   15G   1% /dpool

$ sudo zpool attach dpool raidz1-0 vdd

$ df -h /dpool/
Filesystem      Size  Used Avail Use% Mounted on
dpool            20G  128K   20G   1% /dpool

Now let’s fill that drive up with random data (to avoid compression)

# head -c 35G /dev/urandom > /dpool/big_file
head: error writing 'standard output': No space left on device

# ls -l /dpool/
total 20079455
-rw-r--r-- 1 root root 30762213376 Oct 23 08:45 big_file

# ls -lh /dpool/
total 20G
-rw-r--r-- 1 root root 29G Oct 23 08:45 big_file

Neat - so I have a 20GB capacity filesystem – with a 29GB file on it. ZFS expansion works! You just get a bad estimate on how much free space you have.

Good advice: If you can build the RAIDZ with the right number of disks up front, do so. Expansion does work, but you pay a little bit of a tax because the usable storage capacity goes up when you have more disks, and expansion doesn’t fix that ratio for previously written data.

Any further clarifications, comments, or pointers to FAQs are appreciated. While I think this is correct, I would love to better understand this.

My real world example is that I have 3 brand new 8TB drives, and 1 more 8TB drive being used for parity in another system. I’m deciding do I build out with 3 drives then later expand when I decomission the old system (after copying data) - or run without parity on the old system and have a complete 4 drive array.. decisions, decisions..

I’m not sure this gets to the heart of what you’re after (and sorry not to have much useful to say in that; expanding raidz isn’t something I’ve had call to dabble in), but as to the bad estimation of size, I’d avoid using df for these purposes. Either zpool list or zfs list <dataset> should give you more accurate stats.

1 Like

The cool thing about using a VM to experiment with - is I can quickly tear down and replicate.

Let’s first build a 4 x 10G array - but starting with 2, then adding 2 more sequentially

$ sudo zpool create dpool raidz vda vdb

$ sudo zpool attach dpool raidz1-0 vdc

$ sudo zpool attach dpool raidz1-0 vdd

$ zfs list
NAME         USED  AVAIL  REFER  MOUNTPOINT
dpool        140K  19.1G    24K  /dpool

$ zpool list
NAME    SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
dpool  39.5G   354K  39.5G        -         -     0%     0%  1.00x    ONLINE  -

$ df -h /dpool/
Filesystem      Size  Used Avail Use% Mounted on
dpool            20G  128K   20G   1% /dpool

It seems that df and zfs list agree.

However, clearly there is more space available than either of these metrics claim. Let’s go make a 25G file.

$ sudo -i

# head -c 25G /dev/urandom > /dpool/big-file

# zfs list
NAME         USED  AVAIL  REFER  MOUNTPOINT
dpool       16.7G  2.43G  16.7G  /dpool

[root@demo:~]# zpool list
NAME    SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
dpool  39.5G  33.4G  6.08G        -         -     1%    84%  1.00x    ONLINE  -

# df -h /dpool
Filesystem      Size  Used Avail Use% Mounted on
dpool            20G   17G  2.5G  88% /dpool

# ls -lh /dpool/
total 17G
-rw-r--r-- 1 root root 25G Oct 26 10:30 big-file

Ok - wow.. now I’m more confused. Is compression that good? I can fit 25G of random data into 17G of storage!

Let’s go further down the rabbit hole - I’m going to re-create the 4x10G array - but all at once this time

$ sudo zpool create dpool raidz vda vdb vdc vdd

$ zfs list
NAME         USED  AVAIL  REFER  MOUNTPOINT
dpool        145K  28.6G  32.9K  /dpool

$ zpool list
NAME    SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
dpool  39.5G   194K  39.5G        -         -     0%     0%  1.00x    ONLINE  -

$ df -h /dpool/
Filesystem      Size  Used Avail Use% Mounted on
dpool            29G  128K   29G   1% /dpool

Now we can compare the zfs list between the two creation styles - and there is a big difference between ‘AVAIL’ (19.1G vs 28.6G) - similarly df shows a similar difference.

What if we create a similar 25G file?

$ sudo -i

# head -c 25G /dev/urandom > /dpool/big-file

# zfs list
NAME         USED  AVAIL  REFER  MOUNTPOINT
dpool       25.0G  3.64G  25.0G  /dpool

# zpool list
NAME    SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
dpool  39.5G  33.4G  6.08G        -         -     1%    84%  1.00x    ONLINE  -

# df -h /dpool
Filesystem      Size  Used Avail Use% Mounted on
dpool            29G   25G  3.7G  88% /dpool

# ls -lh /dpool/
total 25G
-rw-r--r-- 1 root root 25G Oct 26 10:41 big-file

uh.. wow.. big difference here from zfs list - now it’s showing me that I did use 25G of storage, I also in theory have more available.. but given the first difference I am starting to doubt what these numbers are actually showing me. zpool list is showing me a consistent story between the two - so that’s good, I guess.

Maybe all I’ve proven here in this experiment is that df and zfs list give pretty similar answers, and I guess that zpool list is looking under the covers and telling you what is actually occupied - but you don’t know how much usable filesystem you really have based on that number (because of parity)

That is an interesting finding. I’m on the couch sick right now, so the details are foggy, but one of Michael Lucas and Allan Jude’s explains how ZFS and df look differently at usage. They’re great reads, in case you have a chance to check them out.

It is weird that more data is being written to not-enough space. I wonder if there’s something about the virtual disk creation that’s allowing that. Maybe some more coherent folks will have some ideas.

@tvcvt I hope you feel better soon.

To be clear, this all came about as I was experimenting with ZFS expansion. I don’t think it’s reasonable to create a 4 disk RAIDZ1 by starting with only 2 drives and expanding, I just noticed the odd free space reporting.

I believe I have experimental evidence (shown above) where if you were to create an empty 4 disk array by starting with 2, then adding 2 more disks sequentially - you do in fact get nearly the same usable storage space as you would by creating a 4 disk array all at once. I don’t think this is about virtual disks, but just how free space is reported when you expand by adding more disks.

Again, my question is really - am I coming to any incorrect conclusions here? Yes, it is always better to create the RAIDZ1 in it’s final state - but if you have to expand by one disk, you do get pretty close to the same total storage (but free space reporting gets a bit.. odd)

As far as I know this is a known bug/feature whereby ZFS uses the original data:redundancy ratio to estimate the free space in terms of actual data that can be stored rather than the new ratio (and obviously everything new that will use up free space will use the new ratio).

So, if I understand this correctly (and I may not … d’oh), a 2x RAIDZ1 has a 1:1 data:redundancy ratio. When you expand it to a 4x RAIDZ1 all new data will be written with a 3:1 ratio not 1:1, so you will get 1.5x as much data in the same free space as you would if you wrote it as 1:1.

Supposing you have 2TB actual blocks free (according to zpool list), then if ZFS uses 3:1 it would report it as having 1.5TB useable space free, and at 1:1 it would report it as 1TB useable free.

So I believe that this is why after expansion from 2x to 4x RAIDZ1, ZFS effectively reports free space as 33% less than it should do.

Yes, that was my conclusion as well. I’ll tag your answer as the ‘solution’ and come back and fix it if someone comes forward with a better answer.

I think that the ZFS team have denied that it is a bug i.e. it’s a feature that they are not going to fix. But…

Meh - in the end with ZFS the best you will get is an ‘estimate’ of free space. The dilemma is that the guess is always wrong, especially with compression being on by default. Still, it’d be nice if the estimate was a bit more correct - but the right number probably varies so much based on how much of the storage is used up before you add a disk that it’s probably hard to pick any value/calculation.

My scenario above was absolutely and edge case. Mostly my concern was that if I opted to ‘expand’ my array - was I at a huge disadvantage. The answer is no.

Best if you can build out the array with as many drives you need at first, but expansion is just another tool in the toolbox that will help you meet future needs.