From 22fbf25cfbee7f881c85d75a1761e41776a06979 Mon Sep 17 00:00:00 2001 From: David Young Date: Fri, 28 Jul 2017 10:25:38 +1200 Subject: [PATCH] Added recipe for ceph --- docs/README.md | 8 +- docs/ha-docker-swarm/design.md | 14 ++ docs/ha-docker-swarm/shared-storage-ceph.md | 179 ++++++++++++++++++ ...d-storage.md => shared-storage-gluster.md} | 6 + docs/recipies/docker-mailserver.md | 3 + examples/ceph.sh | 67 +++++++ mkdocs.yml | 3 +- 7 files changed, 277 insertions(+), 3 deletions(-) create mode 100644 docs/ha-docker-swarm/shared-storage-ceph.md rename docs/ha-docker-swarm/{shared-storage.md => shared-storage-gluster.md} (94%) create mode 100644 docs/recipies/docker-mailserver.md create mode 100644 examples/ceph.sh diff --git a/docs/README.md b/docs/README.md index 7a06b4e..2d12d8f 100644 --- a/docs/README.md +++ b/docs/README.md @@ -2,6 +2,10 @@ ## Structure -1. "Recipies" are sorted by degree of geekiness required to complete them. Relatively straightforward projects are "beginner", more complex projects are "intermediate", and the really fun ones are "advanced". -2. Each recipe contains enough detail in a single page to take a project to completion. +1. "Recipies" generally follow on from each other. I.e., if a particular recipe requires a mail server, that mail server would have been described in an earlier recipe. +2. Each recipe contains enough detail in a single page to take a project from start to completion. 3. When there are optional add-ons/integrations possible to a project (i.e., the addition of "smart LED bulbs" to Home Assistant), this will be reflected either as a brief "Chef's note" after the recipe, or if they're substantial enough, as a sub-page of the main project + +## Conventions + +1. When creating swarm networks, we always explicitly set the subnet in the overlay network, to avoid potential conflicts (which docker won't prevent, but which will generate errors) (https://github.com/moby/moby/issues/26912) diff --git a/docs/ha-docker-swarm/design.md b/docs/ha-docker-swarm/design.md index 101373e..00196c1 100644 --- a/docs/ha-docker-swarm/design.md +++ b/docs/ha-docker-swarm/design.md @@ -24,6 +24,20 @@ This means that: * Services are defined using docker-compose v3 YAML syntax * Services are portable, meaning a particular stack could be shut down and moved to a new provider with minimal effort. +## Security + +Under this design, the only inbound connections we're permitting to our docker swarm are: + +### Network Flows + +* HTTP (TCP 80) : Redirects to https +* HTTPS (TCP 443) : Serves individual docker containers via SSL-encrypted reverse proxy + +### Authentication + +* Where the proxied application provides a trusted level of authentication, or where the application requires public exposure, + + ## High availability ### Normal function diff --git a/docs/ha-docker-swarm/shared-storage-ceph.md b/docs/ha-docker-swarm/shared-storage-ceph.md new file mode 100644 index 0000000..192a90a --- /dev/null +++ b/docs/ha-docker-swarm/shared-storage-ceph.md @@ -0,0 +1,179 @@ +# Introduction + +While Docker Swarm is great for keeping containers running (_and restarting those that fail_), it does nothing for persistent storage. This means if you actually want your containers to keep any data persistent across restarts (_hint: you do!_), you need to provide shared storage to every docker node. + +## Design + +### Why not GlusterFS? +I originally provided shared storage to my nodes using GlusterFS (see the next recipe for details), but found it difficult to deal with because: + +1. GlusterFS requires (n) "bricks", where (n) **has** to be a multiple of your replica count. I.e., if you want 2 copies of everything on shared storage (the minimum to provide redundancy), you **must** have either 2, 4, 6 (etc..) bricks. The HA swarm design calls for minimum of 3 nodes, and so under GlusterFS, my third node can't participate in shared storage at all, unless I start doubling up on bricks-per-node (which then impacts redundancy) +2. GlusterFS turns out to be a giant PITA when you want to restore a failed node. There are at [least 14 steps to follow](https://access.redhat.com/documentation/en-US/Red_Hat_Storage/3/html/Administration_Guide/sect-Replacing_Hosts.html) to replace a brick. +3. I'm pretty sure I messed up the 14-step process above anyway. My replaced brick synced with my "original" brick, but produced errors when querying status via the CLI, and hogged 100% of 1 CPU on the replaced node. Inexperienced with GlusterFS, and unable to diagnose the fault, I switched to a Ceph cluster instead. + +### Why Ceph? + +1. I'm more familiar with Ceph - I use it in the OpenStack designs I manage +2. Replacing a failed node is **easy**, provided you can put up with the I/O load of rebalancing OSDs after the replacement. +3. CentOS Atomic includes the ceph client in the OS, so while the Ceph OSD/Mon/MSD are running under containers, I can keep an eye (and later, automatically monitor) the status of Ceph from the base OS. + +## Ingredients + +!!! summary "Ingredients" + 3 x Virtual Machines (configured earlier), each with: + + * [X] CentOS/Fedora Atomic + * [X] At least 1GB RAM + * [X] At least 20GB disk space (_but it'll be tight_) + * [X] Connectivity to each other within the same subnet, and on a low-latency link (_i.e., no WAN links_) + * [ ] A second disk dedicated to the Ceph OSD + +## Preparation + +### SELinux + +Since our Ceph components will be containerized, we need to ensure the SELinux context on the base OS's ceph files is set correctly: + +``` +chcon -Rt svirt_sandbox_file_t /etc/ceph +chcon -Rt svirt_sandbox_file_t /var/lib/ceph +``` +### Setup Monitors + +Pick a node, and run the following to stand up the first Ceph mon. Be sure to replace the values for **MON_IP** and **CEPH_PUBLIC_NETWORK** to those specific to your deployment: + +``` +docker run -d --net=host \ +--restart always \ +-v /etc/ceph:/etc/ceph \ +-v /var/lib/ceph/:/var/lib/ceph/ \ +-e MON_IP=192.168.31.11 \ +-e CEPH_PUBLIC_NETWORK=192.168.31.0/24 \ +--name="ceph-mon" \ +ceph/daemon mon +``` + +Now **copy** the contents of /etc/ceph on this first node to the remaining nodes, and **then** run the docker command above on each remaining node. You'll end up with a cluster with 3 monitors (odd number is required for quorum, same as Docker Swarm), and no OSDs (yet) + + +### Setup OSDs + +Since we have a OSD-less mon-only cluster currently, prepare for OSD creation by dumping the auth credentials for the OSDs into the appropriate location on the base OS: + +``` +ceph auth get client.bootstrap-osd -o \ +/var/lib/ceph/bootstrap-osd/ceph.keyring +``` + +On each node, you need a dedicated disk for the OSD. In the example below, I used _/dev/vdd_ (the entire disk, no partitions) for the OSD. + +Run the following command on every node: + +``` +docker run -d --net=host \ +--privileged=true \ +--pid=host \ +-v /etc/ceph:/etc/ceph \ +-v /var/lib/ceph/:/var/lib/ceph/ \ +-v /dev/:/dev/ \ +-e OSD_DEVICE=/dev/vdd \ +-e OSD_TYPE=disk \ +--name="ceph-osd" \ +--restart=always \ +ceph/daemon osd +``` + +Watch the output by running ```docker logs ceph-osd -f```, and confirm success. + +!!! note "Zapping the device" + The Ceph OSD container will refuse to destroy a partition containing existing data, so it may be necessary to "zap" the target disk, using: + ``` + docker run -d --privileged=true \ + -v /dev/:/dev/ \ + -e OSD_DEVICE=/dev/sdd \ + ceph/daemon zap_device + ``` + +### Setup MDSs + +In order to mount our ceph pools as filesystems, we'll need Ceph MDS(s). Run the following on each node: + +``` +docker run -d --net=host \ +--name ceph-mds \ +--restart always \ +-v /var/lib/ceph/:/var/lib/ceph/ \ +-v /etc/ceph:/etc/ceph \ +-e CEPHFS_CREATE=1 \ +-e CEPHFS_DATA_POOL_PG=256 \ +-e CEPHFS_METADATA_POOL_PG=256 \ +ceph/daemon mds +``` +### Apply tweaks + +The ceph container seems to configure a pool default of 3 replicas (3 copies of each block are retained), which is one too many for our cluster (we are only protecting against the failure of a single node). + +Run the following on any node to reduce the size of the pool to 2 replicas: + +``` +ceph osd pool set cephfs_data size 2 +ceph osd pool set cephfs_metadata size 2 +``` + +Disabled "scrubbing" (which can be IO-intensive, and is unnecessary on a VM) with: + +``` +ceph osd set noscrub +ceph osd set nodeep-scrub +``` + + +### Create credentials for swarm + +In order to mount the ceph volume onto our base host, we need to provide cephx authentication credentials. + +On **one** node, create a client for the docker swarm: + +``` +ceph auth get-or-create client.dockerswarm osd \ +'allow rw' mon 'allow r' mds 'allow' > /etc/ceph/keyring.dockerswarm +``` + +Grab the secret associated with the new user (you'll need this for the /etc/fstab entry below) by running: + +``` +ceph-authtool /etc/ceph/keyring.dockerswarm -p -n client.dockerswarm +``` + +### Mount MDS volume + +On each noie, create a mountpoint for the data, by running ```mkdir /var/data```, add an entry to fstab to ensure the volume is auto-mounted on boot, and ensure the volume is actually _mounted_ if there's a network / boot delay getting access to the gluster volume: + +``` +mkdir /var/data + +MYHOST=`hostname -s` +echo -e " +# Mount cephfs volume \n +$MYHOST:6789:/ /var/data/ ceph \ +name=dockerswarm\ +,secret=\ +,noatime,_netdev,context=system_u:object_r:svirt_sandbox_file_t:s0\ +0 2" >> /etc/fstab +mount -a +``` + +## Serving + +After completing the above, you should have: + +``` +[X] Persistent storage available to every node +[X] Resiliency in the event of the failure of a single node +``` + +## Chef's Notes + +Future enhancements to this recipe include: + +1. Rather than pasting a secret key into /etc/fstab (which feels wrong), I'd prefer to be able to set "secretfile" in /etc/fstab (which just points ceph.mount to a file containing the secret), but under the current CentOS Atomic, we're stuck with "secret", per https://bugzilla.redhat.com/show_bug.cgi?id=1030402 diff --git a/docs/ha-docker-swarm/shared-storage.md b/docs/ha-docker-swarm/shared-storage-gluster.md similarity index 94% rename from docs/ha-docker-swarm/shared-storage.md rename to docs/ha-docker-swarm/shared-storage-gluster.md index db215bf..a11753e 100644 --- a/docs/ha-docker-swarm/shared-storage.md +++ b/docs/ha-docker-swarm/shared-storage-gluster.md @@ -2,6 +2,12 @@ While Docker Swarm is great for keeping containers running (_and restarting those that fail_), it does nothing for persistent storage. This means if you actually want your containers to keep any data persistent across restarts (_hint: you do!_), you need to provide shared storage to every docker node. +## Design + +### Why GlusterFS? + +This GlusterFS recipe was my original design for shared storage, but I [found it to be flawed](ha-docker-swarm/shared-storage-ceph/#why-not-glusterfs), and I replaced it with a [design which employs Ceph instead](http://localhost:8000/ha-docker-swarm/shared-storage-ceph/#why-ceph). This recipe is an alternate to the Ceph design, if you happen to prefer GlusterFS. + ## Ingredients !!! summary "Ingredients" diff --git a/docs/recipies/docker-mailserver.md b/docs/recipies/docker-mailserver.md new file mode 100644 index 0000000..a00f0cf --- /dev/null +++ b/docs/recipies/docker-mailserver.md @@ -0,0 +1,3 @@ + mkdir {maildata,mailstate,config} + + diff --git a/examples/ceph.sh b/examples/ceph.sh new file mode 100644 index 0000000..1078d55 --- /dev/null +++ b/examples/ceph.sh @@ -0,0 +1,67 @@ +sudo chcon -Rt svirt_sandbox_file_t /etc/ceph +sudo chcon -Rt svirt_sandbox_file_t /var/lib/ceph + +docker run -d --net=host \ +--privileged=true \ +--pid=host \ +-v /etc/ceph:/etc/ceph \ +-v /var/lib/ceph/:/var/lib/ceph/ \ +-v /dev/:/dev/ \ +-e OSD_DEVICE=/dev/vdd \ +-e OSD_TYPE=disk \ +--name="ceph-osd" \ +--restart=always \ +ceph/daemon osd + + + +docker run -d --net=host \ +--restart always \ +-v /etc/ceph:/etc/ceph \ +-v /var/lib/ceph/:/var/lib/ceph/ \ +-e MON_IP=192.168.31.11 \ +-e CEPH_PUBLIC_NETWORK=192.168.31.0/24 \ +--name="ceph-mon" \ +ceph/daemon mon + +On other nodes + +ceph auth get client.bootstrap-osd -o /var/lib/ceph/bootstrap-osd/ceph.keyring + + +docker run -d --net=host \ +--name ceph-mds \ +--restart always \ +-v /var/lib/ceph/:/var/lib/ceph/ \ +-v /etc/ceph:/etc/ceph \ +-e CEPHFS_CREATE=0 \ +ceph/daemon mds + + +ceph auth get-or-create client.dockerswarm osd 'allow rw' mon 'allow r' mds 'allow' > /etc/ceph/keyring.dockerswarm + +ceph-authtool /etc/ceph/keyring.dockerswarm -p -n client.dockerswarm + +Note that current design seems to provide 3 replicas, which is probably overkill: + +[root@ds3 traefik]# ceph osd pool get cephfs_data size +size: 3 +[root@ds3 traefik]# + + +So I set it to 2 + +[root@ds3 traefik]# ceph osd pool set cephfs_data size 2 +set pool 1 size to 2 +[root@ds3 traefik]# ceph osd pool get cephfs_data size +size: 2 +[root@ds3 traefik]# + +Would like to be able to set secretfile in /etc/fstab, but for now it loosk like we're stuch with --secret, per https://bugzilla.redhat.com/show_bug.cgi?id=1030402 + +Euught. ceph writes are slow (surprise!) + +I disabled scrubbing with: + +ceph osd set noscrub +ceph osd set nodeep-scrub diff --git a/mkdocs.yml b/mkdocs.yml index 2685e87..d09bd78 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -20,7 +20,8 @@ pages: - HA Docker Swarm: - Design: ha-docker-swarm/design.md - VMs: ha-docker-swarm/vms.md - - Shared Storage: ha-docker-swarm/shared-storage.md + - Shared Storage (Ceph): ha-docker-swarm/shared-storage-ceph.md + - Shared Storage (GlusterFS): ha-docker-swarm/shared-storage-gluster.md - Keepalived: ha-docker-swarm/keepalived.md - Docker Swarm Mode: ha-docker-swarm/docker-swarm-mode.md - Traefik: ha-docker-swarm/traefik.md