1
0
mirror of https://github.com/funkypenguin/geek-cookbook/ synced 2025-12-13 09:46:23 +00:00

Modernize the ceph recipe

This commit is contained in:
David Young
2020-05-25 15:14:02 +12:00
committed by GitHub
parent 3137b1f2ce
commit 81d50951bf
4 changed files with 177 additions and 154 deletions

View File

@@ -2,194 +2,217 @@
While Docker Swarm is great for keeping containers running (_and restarting those that fail_), it does nothing for persistent storage. This means if you actually want your containers to keep any data persistent across restarts (_hint: you do!_), you need to provide shared storage to every docker node. While Docker Swarm is great for keeping containers running (_and restarting those that fail_), it does nothing for persistent storage. This means if you actually want your containers to keep any data persistent across restarts (_hint: you do!_), you need to provide shared storage to every docker node.
## Design ![Ceph Screenshot](../images/ceph.png)
### Why not GlusterFS?
I originally provided shared storage to my nodes using GlusterFS (see the next recipe for details), but found it difficult to deal with because:
1. GlusterFS requires (n) "bricks", where (n) **has** to be a multiple of your replica count. I.e., if you want 2 copies of everything on shared storage (the minimum to provide redundancy), you **must** have either 2, 4, 6 (etc..) bricks. The HA swarm design calls for minimum of 3 nodes, and so under GlusterFS, my third node can't participate in shared storage at all, unless I start doubling up on bricks-per-node (which then impacts redundancy)
2. GlusterFS turns out to be a giant PITA when you want to restore a failed node. There are at [least 14 steps to follow](https://access.redhat.com/documentation/en-US/Red_Hat_Storage/3/html/Administration_Guide/sect-Replacing_Hosts.html) to replace a brick.
3. I'm pretty sure I messed up the 14-step process above anyway. My replaced brick synced with my "original" brick, but produced errors when querying status via the CLI, and hogged 100% of 1 CPU on the replaced node. Inexperienced with GlusterFS, and unable to diagnose the fault, I switched to a Ceph cluster instead.
### Why Ceph?
1. I'm more familiar with Ceph - I use it in the OpenStack designs I manage
2. Replacing a failed node is **easy**, provided you can put up with the I/O load of rebalancing OSDs after the replacement.
3. CentOS Atomic includes the ceph client in the OS, so while the Ceph OSD/Mon/MSD are running under containers, I can keep an eye (and later, automatically monitor) the status of Ceph from the base OS.
## Ingredients ## Ingredients
!!! summary "Ingredients" !!! summary "Ingredients"
3 x Virtual Machines (configured earlier), each with: 3 x Virtual Machines (configured earlier), each with:
* [X] CentOS/Fedora Atomic * [X] Support for "modern" versions of Python and LVM
* [X] At least 1GB RAM * [X] At least 1GB RAM
* [X] At least 20GB disk space (_but it'll be tight_) * [X] At least 20GB disk space (_but it'll be tight_)
* [X] Connectivity to each other within the same subnet, and on a low-latency link (_i.e., no WAN links_) * [X] Connectivity to each other within the same subnet, and on a low-latency link (_i.e., no WAN links_)
* [X] A second disk dedicated to the Ceph OSD * [X] A second disk dedicated to the Ceph OSD
* [X] Each node should have the IP of every other participating node hard-coded in /etc/hosts (*including its own IP*)
## Preparation ## Preparation
### SELinux !!! tip "No more [foolish games](https://www.youtube.com/watch?v=UNoouLa7uxA)"
Earlier iterations of this recipe (*based on [Ceph Jewel](https://docs.ceph.com/docs/master/releases/jewel/)*) required significant manual effort to install Ceph in a Docker environment. In the 2+ years since Jewel was released, significant improvements have been made to the ceph "deploy-in-docker" process, including the [introduction of the cephadm tool](https://ceph.io/ceph-management/introducing-cephadm/). Cephadm is the tool which now does all the heavy lifting, below, for the current version of ceph, codenamed "[Octopus](https://www.youtube.com/watch?v=Gi58pN8W3hY)".
Since our Ceph components will be containerized, we need to ensure the SELinux context on the base OS's ceph files is set correctly: ### Pick a master node
One of your nodes will become the cephadm "master" node. Although all nodes will participate in the Ceph cluster, the master node will be the node which we bootstrap ceph on. It's also the node which will run the Ceph dashboard, and on which future upgrades will be processed. It doesn't matter _which_ node you pick, and the cluster itself will operate in the event of a loss of the master node (although you won't see the dashboard)
### Install cephadm on master node
Run the following on the ==master== node:
``` ```
mkdir /var/lib/ceph MYIP=`ip route get 1.1.1.1 | grep -oP 'src \K\S+'`
chcon -Rt svirt_sandbox_file_t /etc/ceph curl --silent --remote-name --location https://github.com/ceph/ceph/raw/octopus/src/cephadm/cephadm
chcon -Rt svirt_sandbox_file_t /var/lib/ceph chmod +x cephadm
``` mkdir -p /etc/ceph
### Setup Monitors ./cephadm bootstrap --mon-ip $MYIP
Pick a node, and run the following to stand up the first Ceph mon. Be sure to replace the values for **MON_IP** and **CEPH_PUBLIC_NETWORK** to those specific to your deployment:
```
docker run -d --net=host \
--restart always \
-v /etc/ceph:/etc/ceph \
-v /var/lib/ceph/:/var/lib/ceph/ \
-e MON_IP=192.168.31.11 \
-e CEPH_PUBLIC_NETWORK=192.168.31.0/24 \
--name="ceph-mon" \
ceph/daemon mon
``` ```
Now **copy** the contents of /etc/ceph on this first node to the remaining nodes, and **then** run the docker command above (_customizing MON_IP as you go_) on each remaining node. You'll end up with a cluster with 3 monitors (odd number is required for quorum, same as Docker Swarm), and no OSDs (yet) The process takes about 30 seconds, after which, you'll have a MVC (*Minimum Viable Cluster*)[^1], encompassing a single monitor and mgr instance on your chosen node. Here's the complete output from a fresh install:
### Setup Managers ??? "Example output from a fresh cephadm bootstrap"
```
root@raphael:~# MYIP=`ip route get 1.1.1.1 | grep -oP 'src \K\S+'`
root@raphael:~# curl --silent --remote-name --location https://github.com/ceph/ceph/raw/octopus/src/cephadm/cephadm
Since Ceph v12 ("Luminous"), some of the non-realtime cluster management responsibilities are delegated to a "manager". Run the following on every node - only one node will be __active__, the others will be in standby: root@raphael:~# chmod +x cephadm
root@raphael:~# mkdir -p /etc/ceph
root@raphael:~# ./cephadm bootstrap --mon-ip $MYIP
INFO:cephadm:Verifying podman|docker is present...
INFO:cephadm:Verifying lvm2 is present...
INFO:cephadm:Verifying time synchronization is in place...
INFO:cephadm:Unit systemd-timesyncd.service is enabled and running
INFO:cephadm:Repeating the final host check...
INFO:cephadm:podman|docker (/usr/bin/docker) is present
INFO:cephadm:systemctl is present
INFO:cephadm:lvcreate is present
INFO:cephadm:Unit systemd-timesyncd.service is enabled and running
INFO:cephadm:Host looks OK
INFO:root:Cluster fsid: bf3eff78-9e27-11ea-b40a-525400380101
INFO:cephadm:Verifying IP 192.168.38.101 port 3300 ...
INFO:cephadm:Verifying IP 192.168.38.101 port 6789 ...
INFO:cephadm:Mon IP 192.168.38.101 is in CIDR network 192.168.38.0/24
INFO:cephadm:Pulling latest docker.io/ceph/ceph:v15 container...
INFO:cephadm:Extracting ceph user uid/gid from container image...
INFO:cephadm:Creating initial keys...
INFO:cephadm:Creating initial monmap...
INFO:cephadm:Creating mon...
INFO:cephadm:Waiting for mon to start...
INFO:cephadm:Waiting for mon...
INFO:cephadm:mon is available
INFO:cephadm:Assimilating anything we can from ceph.conf...
INFO:cephadm:Generating new minimal ceph.conf...
INFO:cephadm:Restarting the monitor...
INFO:cephadm:Setting mon public_network...
INFO:cephadm:Creating mgr...
INFO:cephadm:Wrote keyring to /etc/ceph/ceph.client.admin.keyring
INFO:cephadm:Wrote config to /etc/ceph/ceph.conf
INFO:cephadm:Waiting for mgr to start...
INFO:cephadm:Waiting for mgr...
INFO:cephadm:mgr not available, waiting (1/10)...
INFO:cephadm:mgr not available, waiting (2/10)...
INFO:cephadm:mgr not available, waiting (3/10)...
INFO:cephadm:mgr is available
INFO:cephadm:Enabling cephadm module...
INFO:cephadm:Waiting for the mgr to restart...
INFO:cephadm:Waiting for Mgr epoch 5...
INFO:cephadm:Mgr epoch 5 is available
INFO:cephadm:Setting orchestrator backend to cephadm...
INFO:cephadm:Generating ssh key...
INFO:cephadm:Wrote public SSH key to to /etc/ceph/ceph.pub
INFO:cephadm:Adding key to root@localhost's authorized_keys...
INFO:cephadm:Adding host raphael...
INFO:cephadm:Deploying mon service with default placement...
INFO:cephadm:Deploying mgr service with default placement...
INFO:cephadm:Deploying crash service with default placement...
INFO:cephadm:Enabling mgr prometheus module...
INFO:cephadm:Deploying prometheus service with default placement...
INFO:cephadm:Deploying grafana service with default placement...
INFO:cephadm:Deploying node-exporter service with default placement...
INFO:cephadm:Deploying alertmanager service with default placement...
INFO:cephadm:Enabling the dashboard module...
INFO:cephadm:Waiting for the mgr to restart...
INFO:cephadm:Waiting for Mgr epoch 13...
INFO:cephadm:Mgr epoch 13 is available
INFO:cephadm:Generating a dashboard self-signed certificate...
INFO:cephadm:Creating initial admin user...
INFO:cephadm:Fetching dashboard port number...
INFO:cephadm:Ceph Dashboard is now available at:
``` URL: https://raphael:8443/
docker run -d --net=host \ User: admin
--privileged=true \ Password: mid28k0yg5
--pid=host \
-v /etc/ceph:/etc/ceph \
-v /var/lib/ceph/:/var/lib/ceph/ \
--name="ceph-mgr" \
--restart=always \
ceph/daemon mgr
```
### Setup OSDs INFO:cephadm:You can access the Ceph CLI with:
Since we have a OSD-less mon-only cluster currently, prepare for OSD creation by dumping the auth credentials for the OSDs into the appropriate location on the base OS: sudo ./cephadm shell --fsid bf3eff78-9e27-11ea-b40a-525400380101 -c /etc/ceph/ceph.conf -k /etc/ceph/ceph.client.admin.keyring
``` INFO:cephadm:Please consider enabling telemetry to help improve Ceph:
ceph auth get client.bootstrap-osd -o \
/var/lib/ceph/bootstrap-osd/ceph.keyring
```
On each node, you need a dedicated disk for the OSD. In the example below, I used _/dev/vdd_ (the entire disk, no partitions) for the OSD. ceph telemetry on
Run the following command on every node: For more information see:
``` https://docs.ceph.com/docs/master/mgr/telemetry/
docker run -d --net=host \
--privileged=true \
--pid=host \
-v /etc/ceph:/etc/ceph \
-v /var/lib/ceph/:/var/lib/ceph/ \
-v /dev/:/dev/ \
-e OSD_FORCE_ZAP=1 \
-e OSD_DEVICE=/dev/vdd \
-e OSD_TYPE=disk \
--name="ceph-osd" \
--restart=always \
ceph/daemon osd_ceph_disk
```
Watch the output by running ```docker logs ceph-osd -f```, and confirm success. INFO:cephadm:Bootstrap complete.
root@raphael:~#
!!! warning "Zapping the device" ```
The Ceph OSD container will normally refuse to destroy a partition containing existing data, but above we are instructing ceph to zap (destroy) whatever is on the partition currently. Don't run this against a device you care about, and if you're unsure, omit the "OSD_FORCE_ZAP" variable
### Setup MDSs
In order to mount our ceph pools as filesystems, we'll need Ceph MDS(s). Run the following on each node:
```
docker run -d --net=host \
--name ceph-mds \
--restart always \
-v /var/lib/ceph/:/var/lib/ceph/ \
-v /etc/ceph:/etc/ceph \
-e CEPHFS_CREATE=1 \
-e CEPHFS_DATA_POOL_PG=256 \
-e CEPHFS_METADATA_POOL_PG=256 \
ceph/daemon mds
```
### Apply tweaks
The ceph container seems to configure a pool default of 3 replicas (3 copies of each block are retained), which is one too many for our cluster (we are only protecting against the failure of a single node).
Run the following on any node to reduce the size of the pool to 2 replicas:
```
ceph osd pool set cephfs_data size 2
ceph osd pool set cephfs_metadata size 2
```
Disabled "scrubbing" (which can be IO-intensive, and is unnecessary on a VM) with:
```
ceph osd set noscrub
ceph osd set nodeep-scrub
```
### Create credentials for swarm ### Prepare other nodes
In order to mount the ceph volume onto our base host, we need to provide cephx authentication credentials. It's now necessary to tranfer the following files to your ==other== nodes, so that cephadm can add them to your cluster, and so that they'll be able to mount the cephfs when we're done:
On **one** node, create a client for the docker swarm: Path on master | Path on non-master
--------------- | -----
`/etc/ceph/ceph.conf` | `/etc/ceph/ceph.conf`
`/etc/ceph/ceph.client.admin.keyring` | `/etc/ceph/ceph.client.admin.keyring`
`/etc/ceph/ceph.pub` | `/root/.ssh/authorized_keys` (append to anything existing)
```
ceph auth get-or-create client.dockerswarm osd \
'allow rw' mon 'allow r' mds 'allow' > /etc/ceph/keyring.dockerswarm
```
Grab the secret associated with the new user (you'll need this for the /etc/fstab entry below) by running: Back on the ==master== node, run `ceph orch host add <node-name>` once for each other node you want to join to the cluster. You can validate the results by running `ceph orch host ls`
``` !!! question "Should we be concerned about giving cephadm using root access over SSH?"
ceph-authtool /etc/ceph/keyring.dockerswarm -p -n client.dockerswarm Not really. Docker is inherently insecure at the host-level anyway (*think what would happen if you launched a global-mode stack with a malicious container image which mounted `/root/.ssh`*), so worrying about cephadm seems a little barn-door-after-horses-bolted. If you take host-level security seriously, consider switching to [Kubernetes](/kubernetes/start/) :)
```
### Mount MDS volume ### Add OSDs
On each node, create a mountpoint for the data, by running ```mkdir /var/data```, add an entry to fstab to ensure the volume is auto-mounted on boot, and ensure the volume is actually _mounted_ if there's a network / boot delay getting access to the gluster volume: Now the best improvement since the days of ceph-deploy and manual disks.. on the ==master== node, run `ceph orch apply osd --all-available-devices`. This will identify any unloved (*unpartitioned, unmounted*) disks attached to each participating node, and configure these disks as OSDs.
### Setup CephFS
On the ==master== node, create a cephfs volume in your cluster, by running `ceph fs volume create data`. Ceph will handle the necessary orchestration itself, creating the necessary pool, mds daemon, etc.
You can watch the progress by running `ceph fs ls` (to see the fs is configured), and `ceph -s` to wait for `HEALTH_OK`
### Mount CephFS volume
On ==every== node, create a mountpoint for the data, by running ```mkdir /var/data```, add an entry to fstab to ensure the volume is auto-mounted on boot, and ensure the volume is actually _mounted_ if there's a network / boot delay getting access to the gluster volume:
``` ```
mkdir /var/data mkdir /var/data
MYHOST=`hostname -s` MYNODES="<node1>,<node2>,<node3>" # Add your own nodes here, comma-delimited
MYHOST=`ip route get 1.1.1.1 | grep -oP 'src \K\S+'`
echo -e " echo -e "
# Mount cephfs volume \n # Mount cephfs volume \n
$MYHOST:6789:/ /var/data/ ceph \ raphael,donatello,leonardo:/ /var/data ceph name=admin,noatime,_netdev 0 0" >> /etc/fstab
name=dockerswarm\
,secret=<YOUR SECRET HERE>\
,noatime,_netdev,context=system_u:object_r:svirt_sandbox_file_t:s0 \
0 2" >> /etc/fstab
mount -a mount -a
``` ```
### Install docker-volume plugin
Upstream bug for docker-latest reported at https://bugs.centos.org/view.php?id=13609
And the alpine fault:
https://github.com/gliderlabs/docker-alpine/issues/317
## Serving ## Serving
After completing the above, you should have: ### Sprinkle with tools
* [X] Persistent storage available to every node Although it's possible to use `cephadm shell` to exec into a container with the necessary ceph tools, it's more convenient to use the native CLI tools. To this end, on each node, run the following, which will install the appropriate apt repository, and install the latest ceph CLI tools:
* [X] Resiliency in the event of the failure of a single node
```
curl -L https://download.ceph.com/keys/release.asc | sudo apt-key add -
cephadm add-repo --release octopus
cephadm install ceph-common
```
### Drool over dashboard
Ceph now includes a comprehensive dashboard, provided by the mgr daemon. The dashboard will be accessible at https://[IP of your ceph master node]:8443, but you'll need to run `ceph dashboard ac-user-create <username> <password> administrator` first, to create an administrator account:
```
root@raphael:~# ceph dashboard ac-user-create batman supermansucks administrator
{"username": "batman", "password": "$2b$12$3HkjY85mav.dq3HHAZiWP.KkMiuoV2TURZFH.6WFfo/BPZCT/0gr.", "roles": ["administrator"], "name": null, "email": null, "lastUpdate": 1590372281, "enabled": true, "pwdExpirationDate": null, "pwdUpdateRequired": false}
root@raphael:~#
```
## Summary
What have we achieved?
!!! summary "Summary"
Created:
* [X] Persistent storage available to every node
* [X] Resiliency in the event of the failure of a single node
* [X] Beautiful dashboard
## The easy, 5-minute install
I share (_with [sponsors][github_sponsor] and [patrons][patreon]_) a private "_premix_" GitHub repository, which includes an ansible playbook for deploying the entire Geek's Cookbook stack, automatically. This means that members can create the entire environment with just a ```git pull``` and an ```ansible-playbook deploy.yml``` 👍
Here's a screencast of the playbook in action. I sped up the boring parts, it actually takes ==5 min== (*you can tell by the timestamps on the prompt*):
![Screencast of ceph install via ansible](https://static.funkypenguin.co.nz/ceph_install_via_ansible_playbook.gif)
[patreon]: https://www.patreon.com/bePatron?u=6982506
[github_sponsor]: https://github.com/sponsors/funkypenguin
## Chef's Notes 📓 ## Chef's Notes 📓
Future enhancements to this recipe include: [^1]: Minimum Viable Cluster acronym copyright, trademark, and whatever else, to Funky Penguin for 1,000,000 years.
1. Rather than pasting a secret key into /etc/fstab (which feels wrong), I'd prefer to be able to set "secretfile" in /etc/fstab (which just points ceph.mount to a file containing the secret), but under the current CentOS Atomic, we're stuck with "secret", per https://bugzilla.redhat.com/show_bug.cgi?id=1030402
2. This recipe was written with Ceph v11 "Jewel". Ceph have subsequently releaesd v12 "Kraken". I've updated the recipe for the addition of "Manager" daemons, but it should be noted that the [only reader so far](https://discourse.geek-kitchen.funkypenguin.co.nz/u/ggilley) to attempt a Ceph install using CentOS Atomic and Ceph v12 had issues with OSDs, which lead him to [move to Ubuntu 1604](https://discourse.geek-kitchen.funkypenguin.co.nz/t/shared-storage-ceph-funky-penguins-geek-cookbook/47/24?u=funkypenguin) instead.

BIN
manuscript/images/ceph.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 151 KiB

View File

@@ -4,22 +4,22 @@
# Copy the contents of "manuscript" to a new "publish" folder # Copy the contents of "manuscript" to a new "publish" folder
mkdir -p publish mkdir -p publish
mkdir -p publish/overrides mkdir -p publish/overrides
cp -r {manuscript,overrides} publish/ cp -pr {manuscript,overrides} publish/
cp mkdocs.yml publish/ cp mkdocs.yml publish/
# Append a common footer to all recipes/swarm docs # # Append a common footer to all recipes/swarm docs
for i in `find publish/manuscript/ -name "*.md" | grep -v index.md` # for i in `find publish/manuscript/ -name "*.md" | grep -v index.md`
do # do
# Does this recipe already have a "tip your waiter" section? # # Does this recipe already have a "tip your waiter" section?
grep -q "Tip your waiter" $i # grep -q "Tip your waiter" $i
if [ $? -eq 1 ] # if [ $? -eq 1 ]
then # then
echo -e "\n" >> $i # echo -e "\n" >> $i
cat scripts/recipe-footer.md >> $i # cat scripts/recipe-footer.md >> $i
else # else
echo "WARNING - hard-coded footer exists in $i" # echo "WARNING - hard-coded footer exists in $i"
fi # fi
done # done
# Now build the docs for publishing # Now build the docs for publishing
mkdocs build -f publish/mkdocs.yml mkdocs build -f publish/mkdocs.yml