1
0
mirror of https://github.com/funkypenguin/geek-cookbook/ synced 2025-12-12 17:26:19 +00:00

Added recipe for ceph

This commit is contained in:
David Young
2017-07-28 10:25:38 +12:00
parent a09d73c35b
commit 22fbf25cfb
7 changed files with 277 additions and 3 deletions

View File

@@ -2,6 +2,10 @@
## Structure
1. "Recipies" are sorted by degree of geekiness required to complete them. Relatively straightforward projects are "beginner", more complex projects are "intermediate", and the really fun ones are "advanced".
2. Each recipe contains enough detail in a single page to take a project to completion.
1. "Recipies" generally follow on from each other. I.e., if a particular recipe requires a mail server, that mail server would have been described in an earlier recipe.
2. Each recipe contains enough detail in a single page to take a project from start to completion.
3. When there are optional add-ons/integrations possible to a project (i.e., the addition of "smart LED bulbs" to Home Assistant), this will be reflected either as a brief "Chef's note" after the recipe, or if they're substantial enough, as a sub-page of the main project
## Conventions
1. When creating swarm networks, we always explicitly set the subnet in the overlay network, to avoid potential conflicts (which docker won't prevent, but which will generate errors) (https://github.com/moby/moby/issues/26912)

View File

@@ -24,6 +24,20 @@ This means that:
* Services are defined using docker-compose v3 YAML syntax
* Services are portable, meaning a particular stack could be shut down and moved to a new provider with minimal effort.
## Security
Under this design, the only inbound connections we're permitting to our docker swarm are:
### Network Flows
* HTTP (TCP 80) : Redirects to https
* HTTPS (TCP 443) : Serves individual docker containers via SSL-encrypted reverse proxy
### Authentication
* Where the proxied application provides a trusted level of authentication, or where the application requires public exposure,
## High availability
### Normal function

View File

@@ -0,0 +1,179 @@
# Introduction
While Docker Swarm is great for keeping containers running (_and restarting those that fail_), it does nothing for persistent storage. This means if you actually want your containers to keep any data persistent across restarts (_hint: you do!_), you need to provide shared storage to every docker node.
## Design
### Why not GlusterFS?
I originally provided shared storage to my nodes using GlusterFS (see the next recipe for details), but found it difficult to deal with because:
1. GlusterFS requires (n) "bricks", where (n) **has** to be a multiple of your replica count. I.e., if you want 2 copies of everything on shared storage (the minimum to provide redundancy), you **must** have either 2, 4, 6 (etc..) bricks. The HA swarm design calls for minimum of 3 nodes, and so under GlusterFS, my third node can't participate in shared storage at all, unless I start doubling up on bricks-per-node (which then impacts redundancy)
2. GlusterFS turns out to be a giant PITA when you want to restore a failed node. There are at [least 14 steps to follow](https://access.redhat.com/documentation/en-US/Red_Hat_Storage/3/html/Administration_Guide/sect-Replacing_Hosts.html) to replace a brick.
3. I'm pretty sure I messed up the 14-step process above anyway. My replaced brick synced with my "original" brick, but produced errors when querying status via the CLI, and hogged 100% of 1 CPU on the replaced node. Inexperienced with GlusterFS, and unable to diagnose the fault, I switched to a Ceph cluster instead.
### Why Ceph?
1. I'm more familiar with Ceph - I use it in the OpenStack designs I manage
2. Replacing a failed node is **easy**, provided you can put up with the I/O load of rebalancing OSDs after the replacement.
3. CentOS Atomic includes the ceph client in the OS, so while the Ceph OSD/Mon/MSD are running under containers, I can keep an eye (and later, automatically monitor) the status of Ceph from the base OS.
## Ingredients
!!! summary "Ingredients"
3 x Virtual Machines (configured earlier), each with:
* [X] CentOS/Fedora Atomic
* [X] At least 1GB RAM
* [X] At least 20GB disk space (_but it'll be tight_)
* [X] Connectivity to each other within the same subnet, and on a low-latency link (_i.e., no WAN links_)
* [ ] A second disk dedicated to the Ceph OSD
## Preparation
### SELinux
Since our Ceph components will be containerized, we need to ensure the SELinux context on the base OS's ceph files is set correctly:
```
chcon -Rt svirt_sandbox_file_t /etc/ceph
chcon -Rt svirt_sandbox_file_t /var/lib/ceph
```
### Setup Monitors
Pick a node, and run the following to stand up the first Ceph mon. Be sure to replace the values for **MON_IP** and **CEPH_PUBLIC_NETWORK** to those specific to your deployment:
```
docker run -d --net=host \
--restart always \
-v /etc/ceph:/etc/ceph \
-v /var/lib/ceph/:/var/lib/ceph/ \
-e MON_IP=192.168.31.11 \
-e CEPH_PUBLIC_NETWORK=192.168.31.0/24 \
--name="ceph-mon" \
ceph/daemon mon
```
Now **copy** the contents of /etc/ceph on this first node to the remaining nodes, and **then** run the docker command above on each remaining node. You'll end up with a cluster with 3 monitors (odd number is required for quorum, same as Docker Swarm), and no OSDs (yet)
### Setup OSDs
Since we have a OSD-less mon-only cluster currently, prepare for OSD creation by dumping the auth credentials for the OSDs into the appropriate location on the base OS:
```
ceph auth get client.bootstrap-osd -o \
/var/lib/ceph/bootstrap-osd/ceph.keyring
```
On each node, you need a dedicated disk for the OSD. In the example below, I used _/dev/vdd_ (the entire disk, no partitions) for the OSD.
Run the following command on every node:
```
docker run -d --net=host \
--privileged=true \
--pid=host \
-v /etc/ceph:/etc/ceph \
-v /var/lib/ceph/:/var/lib/ceph/ \
-v /dev/:/dev/ \
-e OSD_DEVICE=/dev/vdd \
-e OSD_TYPE=disk \
--name="ceph-osd" \
--restart=always \
ceph/daemon osd
```
Watch the output by running ```docker logs ceph-osd -f```, and confirm success.
!!! note "Zapping the device"
The Ceph OSD container will refuse to destroy a partition containing existing data, so it may be necessary to "zap" the target disk, using:
```
docker run -d --privileged=true \
-v /dev/:/dev/ \
-e OSD_DEVICE=/dev/sdd \
ceph/daemon zap_device
```
### Setup MDSs
In order to mount our ceph pools as filesystems, we'll need Ceph MDS(s). Run the following on each node:
```
docker run -d --net=host \
--name ceph-mds \
--restart always \
-v /var/lib/ceph/:/var/lib/ceph/ \
-v /etc/ceph:/etc/ceph \
-e CEPHFS_CREATE=1 \
-e CEPHFS_DATA_POOL_PG=256 \
-e CEPHFS_METADATA_POOL_PG=256 \
ceph/daemon mds
```
### Apply tweaks
The ceph container seems to configure a pool default of 3 replicas (3 copies of each block are retained), which is one too many for our cluster (we are only protecting against the failure of a single node).
Run the following on any node to reduce the size of the pool to 2 replicas:
```
ceph osd pool set cephfs_data size 2
ceph osd pool set cephfs_metadata size 2
```
Disabled "scrubbing" (which can be IO-intensive, and is unnecessary on a VM) with:
```
ceph osd set noscrub
ceph osd set nodeep-scrub
```
### Create credentials for swarm
In order to mount the ceph volume onto our base host, we need to provide cephx authentication credentials.
On **one** node, create a client for the docker swarm:
```
ceph auth get-or-create client.dockerswarm osd \
'allow rw' mon 'allow r' mds 'allow' > /etc/ceph/keyring.dockerswarm
```
Grab the secret associated with the new user (you'll need this for the /etc/fstab entry below) by running:
```
ceph-authtool /etc/ceph/keyring.dockerswarm -p -n client.dockerswarm
```
### Mount MDS volume
On each noie, create a mountpoint for the data, by running ```mkdir /var/data```, add an entry to fstab to ensure the volume is auto-mounted on boot, and ensure the volume is actually _mounted_ if there's a network / boot delay getting access to the gluster volume:
```
mkdir /var/data
MYHOST=`hostname -s`
echo -e "
# Mount cephfs volume \n
$MYHOST:6789:/ /var/data/ ceph \
name=dockerswarm\
,secret=<YOUR SECRET HERE>\
,noatime,_netdev,context=system_u:object_r:svirt_sandbox_file_t:s0\
0 2" >> /etc/fstab
mount -a
```
## Serving
After completing the above, you should have:
```
[X] Persistent storage available to every node
[X] Resiliency in the event of the failure of a single node
```
## Chef's Notes
Future enhancements to this recipe include:
1. Rather than pasting a secret key into /etc/fstab (which feels wrong), I'd prefer to be able to set "secretfile" in /etc/fstab (which just points ceph.mount to a file containing the secret), but under the current CentOS Atomic, we're stuck with "secret", per https://bugzilla.redhat.com/show_bug.cgi?id=1030402

View File

@@ -2,6 +2,12 @@
While Docker Swarm is great for keeping containers running (_and restarting those that fail_), it does nothing for persistent storage. This means if you actually want your containers to keep any data persistent across restarts (_hint: you do!_), you need to provide shared storage to every docker node.
## Design
### Why GlusterFS?
This GlusterFS recipe was my original design for shared storage, but I [found it to be flawed](ha-docker-swarm/shared-storage-ceph/#why-not-glusterfs), and I replaced it with a [design which employs Ceph instead](http://localhost:8000/ha-docker-swarm/shared-storage-ceph/#why-ceph). This recipe is an alternate to the Ceph design, if you happen to prefer GlusterFS.
## Ingredients
!!! summary "Ingredients"

View File

@@ -0,0 +1,3 @@
mkdir {maildata,mailstate,config}

67
examples/ceph.sh Normal file
View File

@@ -0,0 +1,67 @@
sudo chcon -Rt svirt_sandbox_file_t /etc/ceph
sudo chcon -Rt svirt_sandbox_file_t /var/lib/ceph
docker run -d --net=host \
--privileged=true \
--pid=host \
-v /etc/ceph:/etc/ceph \
-v /var/lib/ceph/:/var/lib/ceph/ \
-v /dev/:/dev/ \
-e OSD_DEVICE=/dev/vdd \
-e OSD_TYPE=disk \
--name="ceph-osd" \
--restart=always \
ceph/daemon osd
docker run -d --net=host \
--restart always \
-v /etc/ceph:/etc/ceph \
-v /var/lib/ceph/:/var/lib/ceph/ \
-e MON_IP=192.168.31.11 \
-e CEPH_PUBLIC_NETWORK=192.168.31.0/24 \
--name="ceph-mon" \
ceph/daemon mon
On other nodes
ceph auth get client.bootstrap-osd -o /var/lib/ceph/bootstrap-osd/ceph.keyring
docker run -d --net=host \
--name ceph-mds \
--restart always \
-v /var/lib/ceph/:/var/lib/ceph/ \
-v /etc/ceph:/etc/ceph \
-e CEPHFS_CREATE=0 \
ceph/daemon mds
ceph auth get-or-create client.dockerswarm osd 'allow rw' mon 'allow r' mds 'allow' > /etc/ceph/keyring.dockerswarm
ceph-authtool /etc/ceph/keyring.dockerswarm -p -n client.dockerswarm
Note that current design seems to provide 3 replicas, which is probably overkill:
[root@ds3 traefik]# ceph osd pool get cephfs_data size
size: 3
[root@ds3 traefik]#
So I set it to 2
[root@ds3 traefik]# ceph osd pool set cephfs_data size 2
set pool 1 size to 2
[root@ds3 traefik]# ceph osd pool get cephfs_data size
size: 2
[root@ds3 traefik]#
Would like to be able to set secretfile in /etc/fstab, but for now it loosk like we're stuch with --secret, per https://bugzilla.redhat.com/show_bug.cgi?id=1030402
Euught. ceph writes are slow (surprise!)
I disabled scrubbing with:
ceph osd set noscrub
ceph osd set nodeep-scrub

View File

@@ -20,7 +20,8 @@ pages:
- HA Docker Swarm:
- Design: ha-docker-swarm/design.md
- VMs: ha-docker-swarm/vms.md
- Shared Storage: ha-docker-swarm/shared-storage.md
- Shared Storage (Ceph): ha-docker-swarm/shared-storage-ceph.md
- Shared Storage (GlusterFS): ha-docker-swarm/shared-storage-gluster.md
- Keepalived: ha-docker-swarm/keepalived.md
- Docker Swarm Mode: ha-docker-swarm/docker-swarm-mode.md
- Traefik: ha-docker-swarm/traefik.md