“Zpool create”, Should I attempt to get the documentation changed?

Why?;

https://openzfs.github.io/openzfs-docs/man/master/8/zpool-create.8.html
EXAMPLES

zpool create tank raidz sda sdb sdc sdd sde sdf

https://ubuntu.com/tutorials/setup-zfs-storage-pool#3-creating-a-zfs-pool
To create a striped pool, we run:
sudo zpool create new-pool /dev/sdb /dev/sdc

https://man.freebsd.org/cgi/man.cgi?zpool-create(8)
EXAMPLES
Example 1: Creating a RAID-Z Storage Pool
The following command creates a pool with a single raidz root vdev that
consists of six disks:
# zpool create tank raidz sda sdb sdc sdd sde sdf

Debian gets this right, recommending you use persistent ID’s such as wwn, scsi or if you create a gpt partion use the UUID.

https://wiki.debian.org/ZFS
aidz2 pool (similar to raid-6, ≥ 4 disks, 2 disks redundancy)
zpool create tank raidz2 scsi-35000cca2735cbc38 scsi-35000cca266cc4b3c scsi-35000cca26c108480 scsi-35000cca266ccbdb4

Arch even goes so far as to state:

https://wiki.archlinux.org/title/ZFS#Creating_a_storage_pool
Identify disks
OpenZFS recommends using device IDs when creating ZFS storage pools of less than 10 devices[2]. Use Persistent block device naming#by-id and by-path to identify the list of drives to be used for ZFS pool.
The disk IDs should look similar to the following:
$ ls -lh /dev/disk/by-id/
lrwxrwxrwx 1 root root 9 Aug 12 16:26 ata-ST3000DM001-9YN166_S1F0JKRR → …/…/sdc
lrwxrwxrwx 1 root root 9 Aug 12 16:26 ata-ST3000DM001-9YN166_S1F0JTM1 → …/…/sde
lrwxrwxrwx 1 root root 9 Aug 12 16:26 ata-ST3000DM001-9YN166_S1F0KBP8 → …/…/sdd
lrwxrwxrwx 1 root root 9 Aug 12 16:26 ata-ST3000DM001-9YN166_S1F0KDGY → …/…/sdb

If OpenZFS recommends you use persistent ID’s why doesn’t their own documentation reflect this?

I use Debian BTW, When I created my big pool I used wwn to identify the disks, this worked exactly as expected.

root@Heavy:~# zpool create ocean raidz2 wwn-0x5000cca2ad1aaff8 wwn-0x5000cca2ad1aca44 wwn-0x5000cca2ad1aed0c wwn-0x5000cca2ad1af534 wwn-0x5000cca2ad1af928 wwn-0x5000cca2ad1afe4c wwn-0x5000cca2ad1afef4 wwn-0x5000cca2ad1b0318

I recently made a small 2TB striped pool from 2 1TB disks, I wanted to give zfs the whole disk no partitions, If this pool did not survive reboot that was OK so as an experiment I tried using the OpenZFS documentation method.

root@Heavy:~# zpool create puddle /dev/sda /dev/sdi

Thinking maybe once the pool was made maybe the ID’s used to create would not matter anymore.

When I rebooted the drive designations re shuffled and the pool just disappeared, I then recreated the pool using wwn, and all is well.

But how many new users followed the documentation to the letter but have had this happen and lost data though no fault of their own?

This is a trap.

Is this "drive path re-shuffle on reboot” problem unique to Linux? is that why generic OpenZFS documentation is written this way? How do we explain Ubuntu?

I am an Avionics technician, at one of the shops I worked at was an old sign above the door out to the hangar. It was dirty and faded, 1980’s? 1970’s? The Principal is still solid:

“Follow the spec or cause it to be changed”

It is a solid principal to follow, sometimes the spec is wrong. Once identified it is your responsibility to take steps leading to a correction. This could prevent future technicians from making mistakes that could lead to loss of life.

If this is not unique to Linux I really think this documentation should be changed.

I want to get in contact with the maintainer of this documentation and make recommendations but first I want to make sure its the right move, my use case is not the same as everyone else’s.

4 Likes

The only reason so many people use sdx in doc examples is it’s so short and everybody knows what it means. With that said, I agree with you entirely that it reinforces bad habits, and I would encourage you to work on pull requests to update docs wherever you have the energy.

I try to write my own examples in the form “zpool create tank mirror /dev/disk/by-id/wwn-001 /dev/disk/by-id/wwn-002” for exactly this reason, and have been for a few years now. But you can see how it’s both more verbose and more clunky just to get typed out in the first place! :slight_smile:

4 Likes

I my notes I always use continuation lines to list the drives and often the options.

zpool create tank mirror \
    --option_x \
    /dev/disk/by-id/wwn-001 \
    /dev/disk/by-id/wwn-002

Still a bit clunky but a lot more readable.

That seems wrong. I thought that ZFS searched the drives to identify pools.

I like using the vdev_id.conf method … Create a file /etc/zfs/vdev_id.conf with a set of aliases that map a nice/friendly name to an actual device. I use the by-path, since the internal disk connections rarely change.

For example

alias ab-1 /dev/disk/by-path/pci-0000:81:00.0-sas-phy4-lun-0
alias ab-2 /dev/disk/by-path/pci-0000:81:00.0-sas-phy5-lun-0
alias ab-3 /dev/disk/by-path/pci-0000:81:00.0-sas-phy6-lun-0

alias cd-1   /dev/disk/by-path/pci-0000:00:1f.2-ata-1
alias cd-2   /dev/disk/by-path/pci-0000:00:1f.2-ata-2
alias cd-3   /dev/disk/by-path/pci-0000:00:1f.2-ata-3
alias ef-1   /dev/disk/by-path/pci-0000:00:1f.2-ata-4
alias ef-2   /dev/disk/by-path/pci-0000:00:1f.2-ata-5
alias ef-3   /dev/disk/by-path/pci-0000:00:1f.2-ata-6

If you edit that file, just run a sudo udevadm trigger to recreate the links.

The zpool create becomes something like

zpool create tank mirror \
    --option_x \
    /dev/disk/by-vdev/ab-1 \
    /dev/disk/by-vdev/ab-2
      :

and a zpool status -v shows similar to

        NAME        STATE     READ WRITE CKSUM
        bondi2      ONLINE       0     0     0
          raidz1-0  ONLINE       0     0     0
            ab-1    ONLINE       0     0     0
            ab-2    ONLINE       0     0     0
            ab-3    ONLINE       0     0     0
            cd-1    ONLINE       0     0     0
            cd-2    ONLINE       0     0     0

In rack-mount boxes those aliases map to the drive slot number. Makes it easy to find exactly which drive is having issues …

Yes, that seems wrong, some other issue, to me also. I used to be much more worried about disk names before I realized I could change them with an export/import -d.

Being sure to use multipath devices instead of their components I suspect does matter, but that’s only if you have them, and so I wonder if part of what’s going on with the documentation is that anything more specific than sd[abc] ends up being insufficiently general.

Does zfs create accept devices from sub shell output?

zfs create mypool $(ls -n /dev/disk/by-id | grep wwn | grep -E "sdb|sdc" | awk '{print "/dev/disk/by-id/"$9}')

There’s probably a cleaner way of doing that, but it’s just a thought.

Edit: This assumes the disks you’re going to use aren’t partitioned yet.

That’s a super interesting approach. At $dayjob we use a drive slot to serial number map, then use that as input to script generating gpt partitions with the label as “slotN”

Doesn’t that tie the disk to the slot ? As in, you can’t move a disk from slot to slot ? And replacement disks have to have the correct slot label ?

I mean, it’s 100% deterministic which is good.

Is using partition ID’s also an accepted method as found here:

You do not think pools should disappear if a component drive of the pool moves from /dev/sda to /dev/sdg on reboot?

I need to know if this is true. if this is not the expected behavior the documentation is fine and I need to instead file a bug report.

This is not a one-off quirk of myself / my hardware. This behavior has been noted by many zfs users in many environments. and in each case the wisdom of the community is to create the pool using one of many unchanging identifiers. people seem to prefer one static identifier or another for various reasons but the trend is if a static identifier is used the pool survives reboot with no further action required.

https://www.reddit.com/r/zfs/comments/j62lme/zfs_pool_disappeared_after_reboot/

That is a lot of lost man hours and potentially lost data, we/I should do something to fix this.

I’d expect a pool might disappear in the sense that zpool import -c /etc/zfs/zpool.cache ... might fail to re-import the pool automatically on boot, but you should be able to see the pool with a zpool import and re-import it with zpool import POOLNAME.

In my experience, ZFS doesn’t care if the pool drives change where they are. HOWEVER, you still definitely want to use wwn-id instead of barename if you ever have more than one pool which could potentially have the same name.

I once put a drive that had once been a member of a pool called “data” into a system which already had a live pool named “data”. I was using barenames at the time, and it was fine running that way for a couple of years, until I temporarily added another drive to the system, and that triggered a shuffling of the barenames of the drives involved.

When the system came back up, instead of trying to mount my real pool “data” it tried to mount the ancient pool “data” that my mule drive had once been a member of, and I had nearly an hour of panic–because I’d also decided, during the same evolution, to wipe out my backup system after having carefully verified that prod was good, planning to immediately run another backup.

Everything turned out fine, once I finally realized that the “data” being mounted was not the “data” that I expected it to be. But seriously, that was an AWFUL day, and that’s how I learned to never use bare drivenames that might be screwed with by the system’s firmware!

as a brand new user that (now famously) boned up my initial zfsbootmenu install, well… I must admit that I did just as the documentation says and I created my initial pools with /dev/sdx designations that would be my main pool on my ssd and my data pool on my hdd. I recently added a 3rd hdd that I used the /dev/disk/by-id/wwmxx. I’m still in the process of migrating my data to the new system. I’m wondering if I should wipe the whole thing clean and start over to avoid any issues that may come up. Would any of you recommend this? Or if it does happen will I be able to recover by re-importing the pool as quartsize suggests?

There’s no need to wipe the data. If the disks re-arranging is a concern, you can do as you mentioned and export/import the pool to change the disk names.

zpool export <pool-name>
zpool import -d /dev/disk/by-id/ <pool-name>

This is how I have always seen it explain in docs. They use sda,sdb,etc then mention how and why to not do that.

hope this helps you

2 Likes

Yep, this works. If you’ve got ZFS on root, though, you can’t do it without booting into an alternate system (eg an Ubuntu installer) and importing the pool there first, because you can’t export a pool with mounted datasets, and you can’t unmount your root filesystem.

1 Like

I installed a new pair of disks in my TrueNAS SCALE server today, accidentally created a mirror on the wrong pool, deleted that mirror, and then had to reboot the server before I could add the mirror to the correct pool because TrueNAS needed to clear its caches. Or it needed a quick nap. The “reboot please now” message was sort of vague on the details, which added to my terror rather nicely.

When it came back up, it had reshuffled the /dev/sdX labels of the disks, including those that were already in mirrors in pools.

Fortunately, the only hiccup this caused me was a minor panic attack before I checked zpool status and saw that it was using disk UUIDs (/dev/disk/by-id/$impressive-inomprehensible-number) to keep track of which disk belonged to which mirror and pool.

Unfortunately, when manipulating the disks and mirrors in the TrueNAS SCALE web GUI, it uses /dev/sdX notation. This was especially fun after it reshuffled them on reboot while I was building a new mirror.

Luckily, it also exposes their serial numbers, so I was able to figure out what was going on since I label all the trays/drive doors externally on the NAS with the model number and serial number of a drive as soon as I install it, but it made the whole thing much more involved and confusing than necessary.

I’m not even sure it really counts as a bug, as a ton of ZFS on Linux documentation and guides warn you that you can’t depend on the /dev/sdX labels staying consistent, ever.

I encourage you (as someone who is not a part of that project, admittedly) to file a bug regarding that.

Gladly. It greatly complicated an otherwise simple task, and I’d have been in a much more difficult spot if I hadn’t taken the time to get out my label maker and properly label which disk was in which bay before I inserted any of them.

What is the specific thing I should be asking them for? To expose the … UUID(?) … instead? (What is the actual data that gets exposed by /dev/disk/by-id/? I’ve never really thought about it beyond knowing it was unique and didn’t change.

I’ll definitely describe how confusing it is when Linux decides to redo all the sdX labels mid-pool management operations.

What is the specific thing I should be asking them for? To expose the … UUID(?) … instead?

Personally, I like the WWN ID. But a lot of people prefer the ATA identifier, and some prefer manually created disk labels.

Essentially, anything would be better than the current practice of relying on raw devicenames. The big argument for the ATA identifier is that it generally shows you the model of drive–the big argument against it is it tends to result in horizontally-scrolling nightmare zpool statuses, which is one of the reasons I prefer WWN.

If you’re not familiar with WWN IDs, they’re baked into a drive’s firmware at the factory, are globally unique, and all manufacturers implement them in the same format. ls -l /dev/disk/by-id | grep -i wwn if you want to see what I’m talking about.

3 Likes