1
0
mirror of https://github.com/funkypenguin/geek-cookbook/ synced 2025-12-13 09:46:23 +00:00
Files
geek-cookbook/manuscript/ha-docker-swarm/shared-storage-ceph.md
2017-09-29 07:40:33 +13:00

188 lines
7.0 KiB
Markdown

# Shared Storage (Ceph)
While Docker Swarm is great for keeping containers running (_and restarting those that fail_), it does nothing for persistent storage. This means if you actually want your containers to keep any data persistent across restarts (_hint: you do!_), you need to provide shared storage to every docker node.
## Design
### Why not GlusterFS?
I originally provided shared storage to my nodes using GlusterFS (see the next recipe for details), but found it difficult to deal with because:
1. GlusterFS requires (n) "bricks", where (n) **has** to be a multiple of your replica count. I.e., if you want 2 copies of everything on shared storage (the minimum to provide redundancy), you **must** have either 2, 4, 6 (etc..) bricks. The HA swarm design calls for minimum of 3 nodes, and so under GlusterFS, my third node can't participate in shared storage at all, unless I start doubling up on bricks-per-node (which then impacts redundancy)
2. GlusterFS turns out to be a giant PITA when you want to restore a failed node. There are at [least 14 steps to follow](https://access.redhat.com/documentation/en-US/Red_Hat_Storage/3/html/Administration_Guide/sect-Replacing_Hosts.html) to replace a brick.
3. I'm pretty sure I messed up the 14-step process above anyway. My replaced brick synced with my "original" brick, but produced errors when querying status via the CLI, and hogged 100% of 1 CPU on the replaced node. Inexperienced with GlusterFS, and unable to diagnose the fault, I switched to a Ceph cluster instead.
### Why Ceph?
1. I'm more familiar with Ceph - I use it in the OpenStack designs I manage
2. Replacing a failed node is **easy**, provided you can put up with the I/O load of rebalancing OSDs after the replacement.
3. CentOS Atomic includes the ceph client in the OS, so while the Ceph OSD/Mon/MSD are running under containers, I can keep an eye (and later, automatically monitor) the status of Ceph from the base OS.
## Ingredients
!!! summary "Ingredients"
3 x Virtual Machines (configured earlier), each with:
* [X] CentOS/Fedora Atomic
* [X] At least 1GB RAM
* [X] At least 20GB disk space (_but it'll be tight_)
* [X] Connectivity to each other within the same subnet, and on a low-latency link (_i.e., no WAN links_)
* [ ] A second disk dedicated to the Ceph OSD
## Preparation
### SELinux
Since our Ceph components will be containerized, we need to ensure the SELinux context on the base OS's ceph files is set correctly:
```
mkdir /var/lib/ceph
chcon -Rt svirt_sandbox_file_t /etc/ceph
chcon -Rt svirt_sandbox_file_t /var/lib/ceph
```
### Setup Monitors
Pick a node, and run the following to stand up the first Ceph mon. Be sure to replace the values for **MON_IP** and **CEPH_PUBLIC_NETWORK** to those specific to your deployment:
```
docker run -d --net=host \
--restart always \
-v /etc/ceph:/etc/ceph \
-v /var/lib/ceph/:/var/lib/ceph/ \
-e MON_IP=192.168.31.11 \
-e CEPH_PUBLIC_NETWORK=192.168.31.0/24 \
--name="ceph-mon" \
ceph/daemon mon
```
Now **copy** the contents of /etc/ceph on this first node to the remaining nodes, and **then** run the docker command above (_customizing MON_IP as you go_) on each remaining node. You'll end up with a cluster with 3 monitors (odd number is required for quorum, same as Docker Swarm), and no OSDs (yet)
### Setup OSDs
Since we have a OSD-less mon-only cluster currently, prepare for OSD creation by dumping the auth credentials for the OSDs into the appropriate location on the base OS:
```
ceph auth get client.bootstrap-osd -o \
/var/lib/ceph/bootstrap-osd/ceph.keyring
```
On each node, you need a dedicated disk for the OSD. In the example below, I used _/dev/vdd_ (the entire disk, no partitions) for the OSD.
Run the following command on every node:
```
docker run -d --net=host \
--privileged=true \
--pid=host \
-v /etc/ceph:/etc/ceph \
-v /var/lib/ceph/:/var/lib/ceph/ \
-v /dev/:/dev/ \
-e OSD_DEVICE=/dev/vdd \
-e OSD_TYPE=disk \
--name="ceph-osd" \
--restart=always \
ceph/daemon osd
```
Watch the output by running ```docker logs ceph-osd -f```, and confirm success.
!!! note "Zapping the device"
The Ceph OSD container will refuse to destroy a partition containing existing data, so it may be necessary to "zap" the target disk, using:
```
docker run -d --privileged=true \
-v /dev/:/dev/ \
-e OSD_DEVICE=/dev/sdd \
ceph/daemon zap_device
```
### Setup MDSs
In order to mount our ceph pools as filesystems, we'll need Ceph MDS(s). Run the following on each node:
```
docker run -d --net=host \
--name ceph-mds \
--restart always \
-v /var/lib/ceph/:/var/lib/ceph/ \
-v /etc/ceph:/etc/ceph \
-e CEPHFS_CREATE=1 \
-e CEPHFS_DATA_POOL_PG=256 \
-e CEPHFS_METADATA_POOL_PG=256 \
ceph/daemon mds
```
### Apply tweaks
The ceph container seems to configure a pool default of 3 replicas (3 copies of each block are retained), which is one too many for our cluster (we are only protecting against the failure of a single node).
Run the following on any node to reduce the size of the pool to 2 replicas:
```
ceph osd pool set cephfs_data size 2
ceph osd pool set cephfs_metadata size 2
```
Disabled "scrubbing" (which can be IO-intensive, and is unnecessary on a VM) with:
```
ceph osd set noscrub
ceph osd set nodeep-scrub
```
### Create credentials for swarm
In order to mount the ceph volume onto our base host, we need to provide cephx authentication credentials.
On **one** node, create a client for the docker swarm:
```
ceph auth get-or-create client.dockerswarm osd \
'allow rw' mon 'allow r' mds 'allow' > /etc/ceph/keyring.dockerswarm
```
Grab the secret associated with the new user (you'll need this for the /etc/fstab entry below) by running:
```
ceph-authtool /etc/ceph/keyring.dockerswarm -p -n client.dockerswarm
```
### Mount MDS volume
On each noie, create a mountpoint for the data, by running ```mkdir /var/data```, add an entry to fstab to ensure the volume is auto-mounted on boot, and ensure the volume is actually _mounted_ if there's a network / boot delay getting access to the gluster volume:
```
mkdir /var/data
MYHOST=`hostname -s`
echo -e "
# Mount cephfs volume \n
$MYHOST:6789:/ /var/data/ ceph \
name=dockerswarm\
,secret=<YOUR SECRET HERE>\
,noatime,_netdev,context=system_u:object_r:svirt_sandbox_file_t:s0\
0 2" >> /etc/fstab
mount -a
```
### Install docker-volume plugin
Upstream bug for docker-latest reported at https://bugs.centos.org/view.php?id=13609
And the alpine fault:
https://github.com/gliderlabs/docker-alpine/issues/317
## Serving
After completing the above, you should have:
```
[X] Persistent storage available to every node
[X] Resiliency in the event of the failure of a single node
```
## Chef's Notes
Future enhancements to this recipe include:
1. Rather than pasting a secret key into /etc/fstab (which feels wrong), I'd prefer to be able to set "secretfile" in /etc/fstab (which just points ceph.mount to a file containing the secret), but under the current CentOS Atomic, we're stuck with "secret", per https://bugzilla.redhat.com/show_bug.cgi?id=1030402