mirror of
https://github.com/funkypenguin/geek-cookbook/
synced 2025-12-12 17:26:19 +00:00
Added recipe for ceph
This commit is contained in:
@@ -2,6 +2,10 @@
|
||||
|
||||
## Structure
|
||||
|
||||
1. "Recipies" are sorted by degree of geekiness required to complete them. Relatively straightforward projects are "beginner", more complex projects are "intermediate", and the really fun ones are "advanced".
|
||||
2. Each recipe contains enough detail in a single page to take a project to completion.
|
||||
1. "Recipies" generally follow on from each other. I.e., if a particular recipe requires a mail server, that mail server would have been described in an earlier recipe.
|
||||
2. Each recipe contains enough detail in a single page to take a project from start to completion.
|
||||
3. When there are optional add-ons/integrations possible to a project (i.e., the addition of "smart LED bulbs" to Home Assistant), this will be reflected either as a brief "Chef's note" after the recipe, or if they're substantial enough, as a sub-page of the main project
|
||||
|
||||
## Conventions
|
||||
|
||||
1. When creating swarm networks, we always explicitly set the subnet in the overlay network, to avoid potential conflicts (which docker won't prevent, but which will generate errors) (https://github.com/moby/moby/issues/26912)
|
||||
|
||||
@@ -24,6 +24,20 @@ This means that:
|
||||
* Services are defined using docker-compose v3 YAML syntax
|
||||
* Services are portable, meaning a particular stack could be shut down and moved to a new provider with minimal effort.
|
||||
|
||||
## Security
|
||||
|
||||
Under this design, the only inbound connections we're permitting to our docker swarm are:
|
||||
|
||||
### Network Flows
|
||||
|
||||
* HTTP (TCP 80) : Redirects to https
|
||||
* HTTPS (TCP 443) : Serves individual docker containers via SSL-encrypted reverse proxy
|
||||
|
||||
### Authentication
|
||||
|
||||
* Where the proxied application provides a trusted level of authentication, or where the application requires public exposure,
|
||||
|
||||
|
||||
## High availability
|
||||
|
||||
### Normal function
|
||||
|
||||
179
docs/ha-docker-swarm/shared-storage-ceph.md
Normal file
179
docs/ha-docker-swarm/shared-storage-ceph.md
Normal file
@@ -0,0 +1,179 @@
|
||||
# Introduction
|
||||
|
||||
While Docker Swarm is great for keeping containers running (_and restarting those that fail_), it does nothing for persistent storage. This means if you actually want your containers to keep any data persistent across restarts (_hint: you do!_), you need to provide shared storage to every docker node.
|
||||
|
||||
## Design
|
||||
|
||||
### Why not GlusterFS?
|
||||
I originally provided shared storage to my nodes using GlusterFS (see the next recipe for details), but found it difficult to deal with because:
|
||||
|
||||
1. GlusterFS requires (n) "bricks", where (n) **has** to be a multiple of your replica count. I.e., if you want 2 copies of everything on shared storage (the minimum to provide redundancy), you **must** have either 2, 4, 6 (etc..) bricks. The HA swarm design calls for minimum of 3 nodes, and so under GlusterFS, my third node can't participate in shared storage at all, unless I start doubling up on bricks-per-node (which then impacts redundancy)
|
||||
2. GlusterFS turns out to be a giant PITA when you want to restore a failed node. There are at [least 14 steps to follow](https://access.redhat.com/documentation/en-US/Red_Hat_Storage/3/html/Administration_Guide/sect-Replacing_Hosts.html) to replace a brick.
|
||||
3. I'm pretty sure I messed up the 14-step process above anyway. My replaced brick synced with my "original" brick, but produced errors when querying status via the CLI, and hogged 100% of 1 CPU on the replaced node. Inexperienced with GlusterFS, and unable to diagnose the fault, I switched to a Ceph cluster instead.
|
||||
|
||||
### Why Ceph?
|
||||
|
||||
1. I'm more familiar with Ceph - I use it in the OpenStack designs I manage
|
||||
2. Replacing a failed node is **easy**, provided you can put up with the I/O load of rebalancing OSDs after the replacement.
|
||||
3. CentOS Atomic includes the ceph client in the OS, so while the Ceph OSD/Mon/MSD are running under containers, I can keep an eye (and later, automatically monitor) the status of Ceph from the base OS.
|
||||
|
||||
## Ingredients
|
||||
|
||||
!!! summary "Ingredients"
|
||||
3 x Virtual Machines (configured earlier), each with:
|
||||
|
||||
* [X] CentOS/Fedora Atomic
|
||||
* [X] At least 1GB RAM
|
||||
* [X] At least 20GB disk space (_but it'll be tight_)
|
||||
* [X] Connectivity to each other within the same subnet, and on a low-latency link (_i.e., no WAN links_)
|
||||
* [ ] A second disk dedicated to the Ceph OSD
|
||||
|
||||
## Preparation
|
||||
|
||||
### SELinux
|
||||
|
||||
Since our Ceph components will be containerized, we need to ensure the SELinux context on the base OS's ceph files is set correctly:
|
||||
|
||||
```
|
||||
chcon -Rt svirt_sandbox_file_t /etc/ceph
|
||||
chcon -Rt svirt_sandbox_file_t /var/lib/ceph
|
||||
```
|
||||
### Setup Monitors
|
||||
|
||||
Pick a node, and run the following to stand up the first Ceph mon. Be sure to replace the values for **MON_IP** and **CEPH_PUBLIC_NETWORK** to those specific to your deployment:
|
||||
|
||||
```
|
||||
docker run -d --net=host \
|
||||
--restart always \
|
||||
-v /etc/ceph:/etc/ceph \
|
||||
-v /var/lib/ceph/:/var/lib/ceph/ \
|
||||
-e MON_IP=192.168.31.11 \
|
||||
-e CEPH_PUBLIC_NETWORK=192.168.31.0/24 \
|
||||
--name="ceph-mon" \
|
||||
ceph/daemon mon
|
||||
```
|
||||
|
||||
Now **copy** the contents of /etc/ceph on this first node to the remaining nodes, and **then** run the docker command above on each remaining node. You'll end up with a cluster with 3 monitors (odd number is required for quorum, same as Docker Swarm), and no OSDs (yet)
|
||||
|
||||
|
||||
### Setup OSDs
|
||||
|
||||
Since we have a OSD-less mon-only cluster currently, prepare for OSD creation by dumping the auth credentials for the OSDs into the appropriate location on the base OS:
|
||||
|
||||
```
|
||||
ceph auth get client.bootstrap-osd -o \
|
||||
/var/lib/ceph/bootstrap-osd/ceph.keyring
|
||||
```
|
||||
|
||||
On each node, you need a dedicated disk for the OSD. In the example below, I used _/dev/vdd_ (the entire disk, no partitions) for the OSD.
|
||||
|
||||
Run the following command on every node:
|
||||
|
||||
```
|
||||
docker run -d --net=host \
|
||||
--privileged=true \
|
||||
--pid=host \
|
||||
-v /etc/ceph:/etc/ceph \
|
||||
-v /var/lib/ceph/:/var/lib/ceph/ \
|
||||
-v /dev/:/dev/ \
|
||||
-e OSD_DEVICE=/dev/vdd \
|
||||
-e OSD_TYPE=disk \
|
||||
--name="ceph-osd" \
|
||||
--restart=always \
|
||||
ceph/daemon osd
|
||||
```
|
||||
|
||||
Watch the output by running ```docker logs ceph-osd -f```, and confirm success.
|
||||
|
||||
!!! note "Zapping the device"
|
||||
The Ceph OSD container will refuse to destroy a partition containing existing data, so it may be necessary to "zap" the target disk, using:
|
||||
```
|
||||
docker run -d --privileged=true \
|
||||
-v /dev/:/dev/ \
|
||||
-e OSD_DEVICE=/dev/sdd \
|
||||
ceph/daemon zap_device
|
||||
```
|
||||
|
||||
### Setup MDSs
|
||||
|
||||
In order to mount our ceph pools as filesystems, we'll need Ceph MDS(s). Run the following on each node:
|
||||
|
||||
```
|
||||
docker run -d --net=host \
|
||||
--name ceph-mds \
|
||||
--restart always \
|
||||
-v /var/lib/ceph/:/var/lib/ceph/ \
|
||||
-v /etc/ceph:/etc/ceph \
|
||||
-e CEPHFS_CREATE=1 \
|
||||
-e CEPHFS_DATA_POOL_PG=256 \
|
||||
-e CEPHFS_METADATA_POOL_PG=256 \
|
||||
ceph/daemon mds
|
||||
```
|
||||
### Apply tweaks
|
||||
|
||||
The ceph container seems to configure a pool default of 3 replicas (3 copies of each block are retained), which is one too many for our cluster (we are only protecting against the failure of a single node).
|
||||
|
||||
Run the following on any node to reduce the size of the pool to 2 replicas:
|
||||
|
||||
```
|
||||
ceph osd pool set cephfs_data size 2
|
||||
ceph osd pool set cephfs_metadata size 2
|
||||
```
|
||||
|
||||
Disabled "scrubbing" (which can be IO-intensive, and is unnecessary on a VM) with:
|
||||
|
||||
```
|
||||
ceph osd set noscrub
|
||||
ceph osd set nodeep-scrub
|
||||
```
|
||||
|
||||
|
||||
### Create credentials for swarm
|
||||
|
||||
In order to mount the ceph volume onto our base host, we need to provide cephx authentication credentials.
|
||||
|
||||
On **one** node, create a client for the docker swarm:
|
||||
|
||||
```
|
||||
ceph auth get-or-create client.dockerswarm osd \
|
||||
'allow rw' mon 'allow r' mds 'allow' > /etc/ceph/keyring.dockerswarm
|
||||
```
|
||||
|
||||
Grab the secret associated with the new user (you'll need this for the /etc/fstab entry below) by running:
|
||||
|
||||
```
|
||||
ceph-authtool /etc/ceph/keyring.dockerswarm -p -n client.dockerswarm
|
||||
```
|
||||
|
||||
### Mount MDS volume
|
||||
|
||||
On each noie, create a mountpoint for the data, by running ```mkdir /var/data```, add an entry to fstab to ensure the volume is auto-mounted on boot, and ensure the volume is actually _mounted_ if there's a network / boot delay getting access to the gluster volume:
|
||||
|
||||
```
|
||||
mkdir /var/data
|
||||
|
||||
MYHOST=`hostname -s`
|
||||
echo -e "
|
||||
# Mount cephfs volume \n
|
||||
$MYHOST:6789:/ /var/data/ ceph \
|
||||
name=dockerswarm\
|
||||
,secret=<YOUR SECRET HERE>\
|
||||
,noatime,_netdev,context=system_u:object_r:svirt_sandbox_file_t:s0\
|
||||
0 2" >> /etc/fstab
|
||||
mount -a
|
||||
```
|
||||
|
||||
## Serving
|
||||
|
||||
After completing the above, you should have:
|
||||
|
||||
```
|
||||
[X] Persistent storage available to every node
|
||||
[X] Resiliency in the event of the failure of a single node
|
||||
```
|
||||
|
||||
## Chef's Notes
|
||||
|
||||
Future enhancements to this recipe include:
|
||||
|
||||
1. Rather than pasting a secret key into /etc/fstab (which feels wrong), I'd prefer to be able to set "secretfile" in /etc/fstab (which just points ceph.mount to a file containing the secret), but under the current CentOS Atomic, we're stuck with "secret", per https://bugzilla.redhat.com/show_bug.cgi?id=1030402
|
||||
@@ -2,6 +2,12 @@
|
||||
|
||||
While Docker Swarm is great for keeping containers running (_and restarting those that fail_), it does nothing for persistent storage. This means if you actually want your containers to keep any data persistent across restarts (_hint: you do!_), you need to provide shared storage to every docker node.
|
||||
|
||||
## Design
|
||||
|
||||
### Why GlusterFS?
|
||||
|
||||
This GlusterFS recipe was my original design for shared storage, but I [found it to be flawed](ha-docker-swarm/shared-storage-ceph/#why-not-glusterfs), and I replaced it with a [design which employs Ceph instead](http://localhost:8000/ha-docker-swarm/shared-storage-ceph/#why-ceph). This recipe is an alternate to the Ceph design, if you happen to prefer GlusterFS.
|
||||
|
||||
## Ingredients
|
||||
|
||||
!!! summary "Ingredients"
|
||||
3
docs/recipies/docker-mailserver.md
Normal file
3
docs/recipies/docker-mailserver.md
Normal file
@@ -0,0 +1,3 @@
|
||||
mkdir {maildata,mailstate,config}
|
||||
|
||||
|
||||
67
examples/ceph.sh
Normal file
67
examples/ceph.sh
Normal file
@@ -0,0 +1,67 @@
|
||||
sudo chcon -Rt svirt_sandbox_file_t /etc/ceph
|
||||
sudo chcon -Rt svirt_sandbox_file_t /var/lib/ceph
|
||||
|
||||
docker run -d --net=host \
|
||||
--privileged=true \
|
||||
--pid=host \
|
||||
-v /etc/ceph:/etc/ceph \
|
||||
-v /var/lib/ceph/:/var/lib/ceph/ \
|
||||
-v /dev/:/dev/ \
|
||||
-e OSD_DEVICE=/dev/vdd \
|
||||
-e OSD_TYPE=disk \
|
||||
--name="ceph-osd" \
|
||||
--restart=always \
|
||||
ceph/daemon osd
|
||||
|
||||
|
||||
|
||||
docker run -d --net=host \
|
||||
--restart always \
|
||||
-v /etc/ceph:/etc/ceph \
|
||||
-v /var/lib/ceph/:/var/lib/ceph/ \
|
||||
-e MON_IP=192.168.31.11 \
|
||||
-e CEPH_PUBLIC_NETWORK=192.168.31.0/24 \
|
||||
--name="ceph-mon" \
|
||||
ceph/daemon mon
|
||||
|
||||
On other nodes
|
||||
|
||||
ceph auth get client.bootstrap-osd -o /var/lib/ceph/bootstrap-osd/ceph.keyring
|
||||
|
||||
|
||||
docker run -d --net=host \
|
||||
--name ceph-mds \
|
||||
--restart always \
|
||||
-v /var/lib/ceph/:/var/lib/ceph/ \
|
||||
-v /etc/ceph:/etc/ceph \
|
||||
-e CEPHFS_CREATE=0 \
|
||||
ceph/daemon mds
|
||||
|
||||
|
||||
ceph auth get-or-create client.dockerswarm osd 'allow rw' mon 'allow r' mds 'allow' > /etc/ceph/keyring.dockerswarm
|
||||
|
||||
ceph-authtool /etc/ceph/keyring.dockerswarm -p -n client.dockerswarm
|
||||
|
||||
Note that current design seems to provide 3 replicas, which is probably overkill:
|
||||
|
||||
[root@ds3 traefik]# ceph osd pool get cephfs_data size
|
||||
size: 3
|
||||
[root@ds3 traefik]#
|
||||
|
||||
|
||||
So I set it to 2
|
||||
|
||||
[root@ds3 traefik]# ceph osd pool set cephfs_data size 2
|
||||
set pool 1 size to 2
|
||||
[root@ds3 traefik]# ceph osd pool get cephfs_data size
|
||||
size: 2
|
||||
[root@ds3 traefik]#
|
||||
|
||||
Would like to be able to set secretfile in /etc/fstab, but for now it loosk like we're stuch with --secret, per https://bugzilla.redhat.com/show_bug.cgi?id=1030402
|
||||
|
||||
Euught. ceph writes are slow (surprise!)
|
||||
|
||||
I disabled scrubbing with:
|
||||
|
||||
ceph osd set noscrub
|
||||
ceph osd set nodeep-scrub
|
||||
@@ -20,7 +20,8 @@ pages:
|
||||
- HA Docker Swarm:
|
||||
- Design: ha-docker-swarm/design.md
|
||||
- VMs: ha-docker-swarm/vms.md
|
||||
- Shared Storage: ha-docker-swarm/shared-storage.md
|
||||
- Shared Storage (Ceph): ha-docker-swarm/shared-storage-ceph.md
|
||||
- Shared Storage (GlusterFS): ha-docker-swarm/shared-storage-gluster.md
|
||||
- Keepalived: ha-docker-swarm/keepalived.md
|
||||
- Docker Swarm Mode: ha-docker-swarm/docker-swarm-mode.md
|
||||
- Traefik: ha-docker-swarm/traefik.md
|
||||
|
||||
Reference in New Issue
Block a user