Ceph/CephFS; How is a R/W request serviced by the cluster?

DigitalGarden · May 9, 2024, 8:15am

I’ve watched many of the 45Drives Ceph videos, but still don’t really understand the data flow when an external client asks to read from or write to the cluster. I realize this may be a bit different based on what type of storage access method is being exposed (CephFS, S3, iSCSI, SMB) and whether replication or erasure coding is being used, but this is the gist of my questions;

to handle a request does the monitor(?) just point the external client or gateway to some “URL” in the pool and the client/gateway then more directly connects to that “URL” to R/W the file/object, or does the gateway(?) assemble/cache all the data for the read/write from across the cluster and then present it to the client (is data for a single file striped across OSDs?). IE, Is the process more of an index or a pipe?
If it’s a pipe, doesn’t that introduce a lot of latency, even if a 10G network is being used internal to the cluster because of all the hops the data has to make?

Any print or video reference explaining this piece would be appreciated.

Hutch-45Drives · May 9, 2024, 6:48pm

HI @DigitalGarden, This would vary dramatically depending on how you are accessing the data and how you have your cluster configured.

We can break it down into 3 different types. We have Cephfs(filesystem),RBD(block), and S3(object storage)

Now depending on what you are using there are different methods to access these services but once you do connect to these services ceph will generally act the same. the request will come from those services to the monitors and from there the monitors will hand the request to the CRUSH algorithm which will then calculate where to write the data.

In a replica host failure pool, it will pick the number of replicas you have in OSDs across as many different hosts as you specified. so a 3 rep pool would pick 1 OSD from 3 different hosts and write copies to those 3 OSDs. if something were to happen to one of those OSDs the cluster would the recover that data using the other 2 copies and recreate a 3ed copy on another OSD in the same host

With Erasure coding pools it is different, so the request comes into the monitors and is set to CRUSH, again crush then takes that and calculates where to put the data. in this instance I will use a 2+1 profile so the data will be broken into 2 chunks and 1 parity chunk will be created. these 3 chunks are then put on 3 different OSDs, 1 in each host. if a host or OSD fails the data is then recalculated and rebuilt. If it is an OSD it gets rebuilt on the same host. if it’s a whole host it will be rebuilt on a different host

this process is repeated for every request that ceph sees

What we can do to increase this is add Journal drives to the OSDs which takes the rocksDB off of these slower HDD OSDs and assigns them to the SSD journal drive

Im not going to go into much detail but our ceph-seminar covers the different connection types and how you can access them in short.

RBD will connect the directory to the Monitors for all access to the cluster, all reads and writes and access are through the MONs directly.

With Cephfs, again the request goes to the MONs first but because of its filesystem there is meta tada that goes with the data so there are MDS services which handle these metadata requests to your clients providing the correct info for each client while the actual data is still sent to the MONs

With S3 again its similar to Cephfs where we would have an RGW daemon running that will handle an S3 endpoint which would be a connection point for your S3. from there they would le the clients connect or not connect as well as all the clients see what data is in the buckets. once a client makes a request the objects are sent to the monitors to be distributed into the cluster

RBD will just use ceph packages on your client to mount RBD

Cephfs will also use ceph packages to mount ceph natively or you can have a Gateway server which then reshares this filesystem using SMB or NFS
S3 does not require any additional software along as your protocol supports it

In our deployments, all 3 will use a VIP that will failover in the event of a failure you can also configure a DNS entry for this VIP to connect by hostname.

i hope that covers what you want to know and if not please feel free to post more questions

DigitalGarden · May 9, 2024, 7:32pm

Thanks. I found the @Cephstorage Youtube channel and this video that seems to help, although I need to study a lot more, and I am sure there are others;

https://www.youtube.com/watch?v=PmLPbrf-x9g

Things like Librados, the hashing by object name, and placement groups were pieces I was missing.