Learning Ceph, but also planning for expansion

cknolla · February 23, 2024, 1:56am

I’m looking to learn ceph in a homelab environment, and I just got my HL15, so I’ve got a clean slate storage server to work with. I don’t like the limitations that ZFS has when it comes to arbitrary expansion; in contrast I really like the hybrid raid that Synology provides which self-balances across drives as you add them, similar to what I’ve learned ceph does. I’d like to move away from the closed Synology ecosystem but maintain that powerful feature.

I’m considering making 3 VMs on a single server, using server-level failure domain, and having each one manage 5 drives. I realize this eliminates much of the high availability aspect of server outage since they’re sharing the same motherboard and power supply, but being a homelab, I’m willing to accept that risk. I should still get raid5-like hard drive failure safety if I understand correctly. But, best of all, based on how I understand ceph’s failure domain concept, I could add another physical server down the line, add it to the cluster, let it self-balance, then intentionally kill off one of my VMs, consume the now-unused hard drives with the other 2 VMs, and return to a 3-system cluster, but now with 2 physical servers.

I think this means I could scale infinitely and gradually achieve true high availability once I reach 3 physical servers. Is this concept sound?

I also know that per-drive failure domain is an option to work very similarly to traditional raid, but is it difficult to change failure domain down the line to server-level domain?

Hutch-45Drives · February 23, 2024, 1:06pm

HI @cknolla, what you said is correct. first to answer your question yes you can change the failure domain from OSD to host level on existing data it just requires a lot of rebalancing and shuffling around for it to work.

I am going to suggest although it might not be as good, you could instead simply set a single ceph node using the 1 HL15 for now with OSD failure( a ceph 3rep pool would allow for any 2 of the 15 drives to be lost without issue or do an EC profile). then once you are ready you could grow the cluster to 2 more physical hosts. (this is where my plan isn’t as good because you need to expand by 2 hosts and not 1) this way you do not need to worry about VMs and taking a VM down instead it would all be host level and could then change your rep pool from OSD to host level at this point giving you a full real 3-node ceph cluster with real HA

The idea you had will still work but with the VMs you are adding a lot more overhead and resource usage on the system. instead of only needing to run 1 OS and a few ceph services(minus the drive services for the OSDs) you would be tripping the number of services by using VMs on the 1 HL15 host. not to mention all these services like to use up RAM. With an OSD it is recommended to have 4GB of RAM per OSD. plus you would also have a mon using another 4G and an MGR. the MDS if you are using a file system would also likely use 4G by default nut would likely be increased depending on how much metadata you would have.

When it comes to ceph you need a lot of RAM to be happy.

Also, ceph will not outperform a single server ZFS system so if you are looking for performance maybe ceph is not the answer

cknolla · February 23, 2024, 1:50pm

Thanks for confirming and clarifying Hutch. Definitely a lot to consider there, but thanks for mentioning the RAM requirements as well; I should be good there as I’ll have 192GB. I also built my machine with a Xeon gold 6230, so I’ve got quite a bit of compute to work with. That does make the choice difficult since OSD failure domain makes the most sense starting out, but the switch to host domain would be brutal if I max out the drive bays.
I’m not quite as concerned about performance, so I might be willing to sacrifice that in favor of learning Ceph, but I also appreciate you mentioning that it’s a factor to consider.