How Does Ceph Compare to ZFS in Terms of Data Integrity?

zfslover · August 1, 2024, 7:41am

Easy title, probably long answer.

As Ceph becomes more popular, I would like to know what the ZFS experts think about data integrity. I know that Ceph has replication and erasure coding to prevent data loss due to disk failure. I also know that I can use CRUSH rules to define failure domains (OSD, host, rack, room, datacenter, etc.) and understand that placement groups take a slightly different approach to data loss risk than pure replicated disks in favor of rebuild time.
I would like to hear more details and opinions on all of this from the ZFS community, as well as whether Ceph has a checksum feature like ZFS. Would you use Ceph to store archive data on disk for the long term?
Finally, what is the largest ZFS deployment you have seen? For me (a passionate homelab nerd) I would say I don’t need to use Ceph until I expand my main server with additional JBODs, because I don’t need to distribute geographically and adding JBODs (as long as I have enough expansion cards) works. But when do you think it would make sense to move to Ceph?

mercenary_sysadmin · August 1, 2024, 1:57pm

I don’t have a strong opinion on Ceph beyond “distributed filesystems are incredibly complex, you’d better put your big kid pants on.”

I think I’d start getting potentially interested in Ceph somewhere around the petabyte mark, personally. Sorry I can’t give you more than “potentially.”

The reason I’m both cautious about it and don’t have direct relevant experience with it is partially because performance is also extremely tricky with distributed storage systems. It’s very, very easy to burn the resources of ten or twenty systems to produce worse actual performance than a single system would provide, and with much more complex failure points you need to be capable of addressing. It’s a lot!

zfslover · August 1, 2024, 8:24pm

Thanks for your response. I agree with you that I don’t see much use for Ceph in the Homelab as long as it is possible to extend ZFS with JBODs.

The only real reason other than scaling itself is hyper-converged storage for HA clusters like Proxmox.

mercenary_sysadmin · August 1, 2024, 10:09pm

I think there might be some terminology confusion here.

Generally speaking, “converged” refers to a one-rack system, sold pre-populated with both compute (hypervisor, typically) and storage nodes, ready-to-rock from the factory, typically with high-performance specialty interconnects like Mellanox also being provided in the single rack from one vendor which includes both siloes.

Hyperconverged refers to a virtualization solution in which both compute and storage are in the same chassis, with the PCIe bus itself serving as the highest possible performance interconnect. Eg a Sanoid server which has twelve drives and two CPUs in one 4u chassis, and the VMs it runs don’t need any other device at all since compute, RAM, and storage are all in the same box.

Halfwalker · August 2, 2024, 8:26pm

A buddy runs his home setup on k8s, 4-5 nodes I think ? With ceph providing distributed storage across them. Certainly not petabyte scale …

zfslover · August 2, 2024, 9:55pm

Thanks for the clarification! Always great to learn. As I said, I am more of an advanced noob than a pro who really needs named technologies. What I meant then is HA (and the fact that VMs and storage are on one machine and can be easily replicated node by node makes HA possible as in the case of Proxmox).

zfslover · August 2, 2024, 9:57pm

My understanding is that Ceph can do many things. One is to expand beyond the JBOD scale level, but another reason is HA. For example, there are some people running Ceph on three 250GB SSD disks, which is really small.

zfslover · August 2, 2024, 10:01pm

Besides the fact about the massively increased complexity of Ceph vs. ZFS as a distributed system, I am not sure if my question about checksums was skipped or answered in one. Any thoughts?

I just want to store a lot of data and I cannot accept bit rot. But if the day comes when I can no longer scale ZFS with JBODs, is Ceph good enough, or is it better to take the pain of independent nodes storing slices of my entire dataset and me trying to keep track of where what data is stored?

mercenary_sysadmin · August 3, 2024, 1:50am

I don’t feel comfortable enough in my knowledge of Ceph to give you an authoritative answer about how well it handles bitrot, but my understanding is that it’s quite secure against it.

You definitely need Ceph if you want to do HA with Proxmox, because otherwise only your compute stack is HA, and your storage is SPOF.