Root partition mounts as read-only after failed cockpit-ceph-deploy

NerdyGriffin · April 5, 2024, 6:16pm

So this all started when I was trying to use the cockpit-ceph-deploy to create a ceph cluster out of one HL15 and 3 Raspberry Pi’s… I know, it’s a weird idea.
I was unable to get the ceph-deploy to complete the device-aliases.yml playbook on the Pis even after manually setting us the device aliases according to this github issue, so I decided to give up for now and try to run the ceph-deploy with only the one HL15 (this would basically be a single-node ceph cluster, and then I could try to add the rasp-pi’s later).
In the middle of all this (or possibly before), I physically moved all my 4TB ZFS hard drives from the 1-1 thru 1-5 slots to the 1-11 thru 1-15 slots, and installed an 8tb drive in slot 1-1.
When I tried the ceph-deploy with only the HL15, I kept getting errors during the core.yml playbook, related to ceph_volumes.py and it showed that the ceph playbook was trying to use the ZFS drives only and not the empty 8TB that I put in specifically for the ceph. I tried manually editing the ceph_volumes.py file to do what I want (bad idea) and I got a different error, the details of which I do not remember. I later discovered that the 8TB drive wasn’t even being recognized by the OS anymore, so I decided to reboot before I went digging around in case I had tried to hot-swap that drive a forgot about it…

After THAT reboot, the cockpit did not start up on its own, so I ssh in and discovered that the root partition was mounted as read-only, which had prevented all the services from running at startup.
I am able to use mount -o remount,rw / to remount it as read-write, but it goes back to read-only after a reboot.
When I checked dmesg I cannot find any obvious errors, the only hint I see is that the boot command line:
[ 0.000000] Command line: BOOT_IMAGE=(hd6,gpt2)/vmlinuz-4.18.0-513.18.1.el8_9.x86_64 root=/dev/mapper/rl-root ro crashkernel=auto resume=/dev/mapper/rl-swap rd.lvm.lv=rl/root rd.lvm.lv=rl/swap rhgb quiet
It says “... root=/dev/mapper/rl-root ro ...” which I am guessing means it is mounting as read-only.

The /home mounts as read-write as it should, but the /boot folder is empty. If a run sudo mount -a then /boot is still empty, but I can use mount /dev/nvme0n1p1 /mnt/boot to mount and access the boot partition jut fine.

If anyone has some advice for how I might figure out the root cause of this, I am not sure if it is just a boot config or caused by some sort of I/O problems. I have heard that some linux will mount as read-only if there are disk I/O errors during boot, but I cannot find any indication of any errors on my system…
I also tried using fsck and xfs_repair and it still boots with the root partition as read-only.

I will try to upload the full output of the dmesg if that would be helpful

Hutch-45Drives · April 11, 2024, 1:28pm

HI @NerdyGriffin, I would guess that when you ran the ceph-core playbook it tried to wipe and use your OS boot drive as an OSD drive since the aliasing was not set up correctly.

my guess is now your boot drive is in a broken state because it was wiped/written over with the playbook

NerdyGriffin · April 17, 2024, 12:46am

Ouch, that sounds painful to fix.
Is there any way for me to restore the original boot drive image that it shipped with? I got it working somehow a week ago, possibly by editing the grub boot command line (I have forgotten what I did since then). I suspect it is in a broken state as you said because I keep encountering new weird errors and issues…

Also, if I reinstall the OS, is there an easy way for me to backup and restore the samba net registry?

I also did not realize that the ceph-deploy playbooks would uninstall the zfs packages, I think that should be stated explicitly in the README.
I spent a couple days trying to reverse-engineer the ceph-deploy source code just to figure out what exactly it was doing when the playbook says “Removing packages…”

Hutch-45Drives · April 24, 2024, 11:53am

We are working on getting a master image ISO hosted for homelabbers to download.

the reason the ceph playbooks remove ZFS is if you are building a ceph cluster there is no need for ZFS to be installed and we wouldn’t want to have ZFS pools and ceph OSD drives in the same host as ceph is all about shared storage and ZFS is only local storage

NerdyGriffin · May 2, 2024, 11:44pm

I understand and agree with the reasoning, I just wish it was mentioned in the documentation since it is a potentially breaking change to the system.

NerdyGriffin · May 2, 2024, 11:46pm

Thank you for the update, I will keep an eye for that potential future release

NerdyGriffin · December 21, 2024, 10:24pm

If anyone finds this in the future and is wondering how this turned out:

I ended up cloning the “corrupted” boot drive to a temporary spare, then I added a second NVMe drive using a PCIe riser card and (after many failed attempts) I learned how to install Rocky Linux in a mirrored RAID configuration using LVM on top of an mdadm array. I then reconnected the clone of the original boot drive and manually migrated all of my packages and configs to the new installation. It probably could have been done a lot faster, but I ended up learning a ton about how to install and configure various Linux services that I had never interacted with before, and the resulting system has been even more reliable than before (mostly due to me previously tinkering with things I didn’t understand at the time).
I followed the various guides on the 45Drives Knowledge Base to set up all the features that had originally come pre-installed on the machine; and because I kept a fully bootable clone of the original boot drive, I was able to boot up the original image at any time to compare and contrast anything that wasn’t sure how to find in the system config files.

All in all, it probably took about 3 months to get everything back to fully functional, but the stuff I learned along the way has made future maintenance and deployments so much faster, easier, and more reliable.