[Sanity Check] MariaDB VM on ZFS-Based Host Storage - Optimizing Volblocksize/Recordsize for MariaDB? (Am I doing this right?)

SinisterPisces · July 23, 2024, 2:25pm

Hello,

I’m running Proxmox with my VM and LXC containers stored on ZFS. I’ve set up a Debain 12 VM, but haven’t installed MariaDB yet, or even partitioned/formatted the non-OS virtual SCSI disks inside the VM. I think I just caught myself about to make a mistake and wanted a second opinion.

I’ve got a Debian 12 VM with storage set up like this:

So, above, I’m taking a swing at the four-disk approach for storing MariaDB data in ZFS that a DB admin suggested I experiment with, keeping in mind that each virtual SCSI disk is a zVol.

scsi0: OS boot disk (volblocksize 64k)
scsi1: MariaDB database data disk (volblocksize 16k)
scsi2: MariaDB binlog disk (volblocksize 1M)
scsi3: MariaDB log file disk (volblocksize 1M)

I set this up and felt pretty good about it, then I realized … for best performance in the documentation I read were given in recordsize, assuming storage directly in a dataset.

My VM disks are zVols having a volblocksize, which is not at all the same thing.

So, what’s the best practice here?

Do I go ahead with the volblocksizes as indicated above? Or am I completely going the wrong direction since I just realized as I type this that all QEMU virtual disks have a recommended volblocksize of 64k?

mercenary_sysadmin · July 23, 2024, 3:02pm

16K is still correct for the db data.

I’m not sure for the logs, honestly. On a dataset, yes you’d definitely want recordsize=1M. But I suspect that is a bit aggressive for a zvol.

Is Maria using these zvols raw, or are you formatting them ext4 and mounting them as filesystems?

SinisterPisces · July 23, 2024, 3:21pm

The virtual machine sees all the zvols (virtual scsi disks) as (currently not-formatted) block devices inside the VM. The VirtIO SCSI driver exposes the virtual disk to the VM as an SSD with a logical/physical block size of of 512 bytes.

I was going to format them as ext4, since I don’t really need ZFS inside the VM and I’m trying to keep its RAM footprint down on the host. The virtualized Debian 12 is running on an ext4 formatted partition inside the VM), and have no experience with non-ext4 or non-zfs filesystems.

Here’s what things look like inside the VM. (I’d prefer not to have that swap partition, but I can fix that later.)

root@memory-alpha2:~# parted -l
Model: QEMU QEMU HARDDISK (scsi)
Disk /dev/sda: 68.7GB
Sector size (logical/physical): 512B/512B
Partition Table: gpt
Disk Flags: 

Number  Start   End     Size    File system     Name  Flags
 1      1049kB  538MB   537MB   fat32                 boot, esp
 2      538MB   67.7GB  67.2GB  ext4
 3      67.7GB  68.7GB  1023MB  linux-swap(v1)        swap


Error: /dev/sdb: unrecognised disk label
Model: QEMU QEMU HARDDISK (scsi)                                          
Disk /dev/sdb: 17.2GB
Sector size (logical/physical): 512B/512B
Partition Table: unknown
Disk Flags: 

Error: /dev/sdc: unrecognised disk label
Model: QEMU QEMU HARDDISK (scsi)                                          
Disk /dev/sdc: 17.2GB
Sector size (logical/physical): 512B/512B
Partition Table: unknown
Disk Flags: 

Error: /dev/sdd: unrecognised disk label
Model: QEMU QEMU HARDDISK (scsi)                                          
Disk /dev/sdd: 2147MB
Sector size (logical/physical): 512B/512B
Partition Table: unknown
Disk Flags:

EDIT: Long term, I’ll move the database storage to a dataset on an actual TrueNAS server and mount it to the VM via NFS. At the moment, my shared storage isn’t ready to go, so I’m trying to understand what recommended setup is supposed to be when everything’s in the VM. I might not have been clear on that earlier. Sorry.

mercenary_sysadmin · July 23, 2024, 4:46pm

So, recordsize=1M is a solid recommendation for the db logs if you’re doing MySQL direcly on a ZFS filesystem. I honestly don’t know when it comes to an ext4 filesystem sitting on a ZVOL. It should work well for the actual database workload, but unlike a ZFS filesystem with recordsize=1M, an ext4 filesystem on a ZVOL with volblocksize=1M will be taking 1MiB for every block including metadata–every single inode will eat an entire 1MiB on disk–and I’m just not entirely sure of the implications.

I’d file that under “it might be okay, but I’d definitely trial it very carefully before blindly putting in production,” personally. If I was putting something into production without very careful trialing first, running on ZVOLs, I’d just have a 64K zvol for everything but the db back end, and a 16K zvol for the db back end, personally.

SinisterPisces · July 23, 2024, 5:16pm

Thank you for explaining it this way. This is exactly what I needed.

I find myself continuously struggling because so much ZFS documentation implicitly assumes the correct setup (e.g., everything for the database being stored on an actual bare metal ZFS dataset where the recordsizes are independently controllable), or that the reader is a professional ZFS admin who understands the unspoken rules well enough to handle the edge cases themselves.

(And Proxmox’s documentation is aggressively hands-off about the peculiarities of using ZFS with Proxmox while really wanting you to use ZFS, so there’s that.)

I’ve had people in other forums flame me for wasting their time with “premature optimization” for trying to figure out all the unwritten rules of interactions between various data structures and storage schemes like this, which has not been fun. Like I said, I really appreciate that this is a place where these sorts of questions are welcome.

I know the standard answer in a lot of places is “experiment and test,” but good experimentation requires a solid understanding of the thing you’re trying to experiment on, and it’s so easy to get lost in the undocumented magic weeds. I’ll admit without hesitation that I don’t yet have the skills to meaningfully do that kind of testing in an efficient way. Getting some sort of process for that going is actually on my to-do list.

(For example: I’ve now spent 3.5 hours trying to bring up this MariaDB VM. Half an hour of that was finding a current MariaDB tutorial and setting up the VM with a base Debian 12 install, and 3 hours was trying to research how to implement the storage on ZFS-powered Proxmox before I remembered I could ask here. )

mercenary_sysadmin · July 23, 2024, 6:39pm

In the narrowest sense, they’re likely correct: very, very few MySQL/MariaDB installations will need this level of tuning, and you didn’t lead with anything that indicates this will be an extremely high load high performance system.

With that said, I don’t know that you aren’t setting this up for an extremely heavy-duty use case, and they were perfectly valid questions. I’d rather do my best to answer them than to say “don’t bother,” let alone leave you feeling flamed.

SinisterPisces · July 23, 2024, 7:59pm

In the narrowest sense, they’re likely correct: very, very few MySQL/MariaDB installations will need this level of tuning, and you didn’t lead with anything that indicates this will be an extremely high load high performance system.

As a home hobbyist user, it seems reasonable to me that you might want to stuff everything into a VM, but I keep having to remind myself that ZFS really even now still isn’t meant for the home hobbyist, and is still a bit unusual even for my eventual plans to use it for my home office production file-server and production virtualization machine.

I need to do a better job asking questions about optimization-type topics. Maybe I’m not using the right word? My aim is always not to introduce performance degradation because I don’t know what I’m doing. I don’t need to make it go faster than it otherwise would; I just want to make sure I’m not introducing avoidable slowdowns or otherwise creating problems for myself (like accidentally wasting expensive disk space by using 1M volblocksizes instead of recordsizes).

With that said, I don’t know that you aren’t setting this up for an extremely heavy-duty use case, and they were perfectly valid questions. I’d rather do my best to answer them than to say “don’t bother,” let alone leave you feeling flamed.

I appreciate that.

As far as heavy-duty use cases, I will endeavor to figure out how to make my Minecraft server VM hit the ZFS storage extra hard. Somehow. Autogeneration of 4K quality world maps for web browser-based viewing? (Honestly, none of my use cases right now really seem like they’d ever push the envelope. My goal in building the servers I’ve got was always more doing several less-intense workloads in parallel, with a focus on stability over speed.)

I think if I could wave a magic wand and do one really nice thing for ZFS newbies, it would be lowering the SEO on all the really, really old “Intro to ZFS” material from like 10-15 years ago aimed at corporate IT departments. That got surfaced for me when I started, and I went deep down several rabbit holes I needed not have bothered with vs the modern defaults when I set up my first Proxmox server. Like, I spent hours agonizing at some point over whether I really needed to worry about how extended attributes were stored, or ACL types or denodesize or several of those other things that the older docs say you must tweak so they don’t set your SSDs on fire.

By contrast, modern TrueNAS just sets reasonable defaults and you go.

Modern SSDs (especially used server SSDs, but also a lot of the nicer prosumer ones) don’t seem so nearly made of Plasticine where ZFS is concerned.

mercenary_sysadmin · July 24, 2024, 1:06am

For home/hobbyist use, you almost certainly don’t need to bother tuning for MySQL at all. You can tune for it, obviously, but the odds are very good that you’re not producing enough of a workload to get a seat-of-the-pants difference out of “optimally tuned” verus “I just typed apt install mariadb-server, was that not all I was supposed to do?”

To put things in perspective, I didn’t even bother setting up separate drives with recordsize=16K for the MySQL stuff I set up in my own homelab–there just isn’t enough workload to bother. Now, the clients with ERP setups and hundreds or thousands of concurrent users… those folks get tuning.

In a homelab environment, the major reason to do fine-grained tuning is typically to learn how to do fine-grained tuning in a more forgiving environment, when you plan to take those lessons into more challenging spaces, probably for money, later. And if that’s what you’re doing, you want to take careful notes and make measurements of the impact of all the various tuning optimizations you try!

SinisterPisces · July 24, 2024, 3:06am

I appreciate the perspective on this. I’m definitely guilty of unnecessary tuning, though hopefully less now. My work is indirectly related to a lot of this, even though I don’t touch databases directly (I’m in an IT/Software Engineering adjacent field), so I’m definitely getting value out of learning how to do all this. There’s also a good chance it could become more directly relevant later, so it’s definitely worth it.

While setting up MariaDB, I’ve taken time to make sure I learned how to use fdisk and makefs and e2label properly, which I didn’t know this morning, and actually finally went through a tutorial on how to correctly manage adding disks to fstab, and I just set up public key encryption via SSH for the first time.

Ironically enough, the project that finally pushed me into getting a MariaDB server set up is Bookstack, so I can start building documentation for how I set up all my home server/self-hosted stuff (and more important than that, the network that makes the Netflix and the Hulu and the Amazon Prime work for my roommates; they know where I sleep).

That said, I’ve certainly (gladly) come to the conclusion that I don’t have to have the most perfect ZFS deployment ever to enjoy using ZFS.

I’m also sitting here watching an Intro to Syncthing video at 10 pm, so I might just be strange. ¯_(ツ)_/¯

Thanks for taking the time to keep me from running into too many brick walls.