From 22fbf25cfbee7f881c85d75a1761e41776a06979 Mon Sep 17 00:00:00 2001
From: David Young <davidy@funkypenguin.co.nz>
Date: Fri, 28 Jul 2017 10:25:38 +1200
Subject: [PATCH] Added recipe for ceph

---
 docs/README.md                                |   8 +-
 docs/ha-docker-swarm/design.md                |  14 ++
 docs/ha-docker-swarm/shared-storage-ceph.md   | 179 ++++++++++++++++++
 ...d-storage.md => shared-storage-gluster.md} |   6 +
 docs/recipies/docker-mailserver.md            |   3 +
 examples/ceph.sh                              |  67 +++++++
 mkdocs.yml                                    |   3 +-
 7 files changed, 277 insertions(+), 3 deletions(-)
 create mode 100644 docs/ha-docker-swarm/shared-storage-ceph.md
 rename docs/ha-docker-swarm/{shared-storage.md => shared-storage-gluster.md} (94%)
 create mode 100644 docs/recipies/docker-mailserver.md
 create mode 100644 examples/ceph.sh

diff --git a/docs/README.md b/docs/README.md
index 7a06b4e..2d12d8f 100644
--- a/docs/README.md
+++ b/docs/README.md
@@ -2,6 +2,10 @@
 
 ## Structure
 
-1. "Recipies" are sorted by degree of geekiness required to complete them. Relatively straightforward projects are "beginner", more complex projects are "intermediate", and the really fun ones are "advanced".
-2. Each recipe contains enough detail in a single page to take a project to completion.
+1. "Recipies" generally follow on from each other. I.e., if a particular recipe requires a mail server, that mail server would have been described in an earlier recipe.
+2. Each recipe contains enough detail in a single page to take a project from start to completion.
 3. When there are optional add-ons/integrations possible to a project (i.e., the addition of "smart LED bulbs" to Home Assistant), this will be reflected either as a brief "Chef's note" after the recipe, or if they're substantial enough, as a sub-page of the main project
+
+## Conventions
+
+1. When creating swarm networks, we always explicitly set the subnet in the overlay network, to avoid potential conflicts (which docker won't prevent, but which will generate errors) (https://github.com/moby/moby/issues/26912)
diff --git a/docs/ha-docker-swarm/design.md b/docs/ha-docker-swarm/design.md
index 101373e..00196c1 100644
--- a/docs/ha-docker-swarm/design.md
+++ b/docs/ha-docker-swarm/design.md
@@ -24,6 +24,20 @@ This means that:
 * Services are defined using docker-compose v3 YAML syntax
 * Services are portable, meaning a particular stack could be shut down and moved to a new provider with minimal effort.
 
+## Security
+
+Under this design, the only inbound connections we're permitting to our docker swarm are:
+
+### Network Flows
+
+* HTTP (TCP 80) : Redirects to https
+* HTTPS (TCP 443) : Serves individual docker containers via SSL-encrypted reverse proxy
+
+### Authentication
+
+* Where the proxied application provides a trusted level of authentication, or where the application requires public exposure, 
+
+
 ## High availability
 
 ### Normal function
diff --git a/docs/ha-docker-swarm/shared-storage-ceph.md b/docs/ha-docker-swarm/shared-storage-ceph.md
new file mode 100644
index 0000000..192a90a
--- /dev/null
+++ b/docs/ha-docker-swarm/shared-storage-ceph.md
@@ -0,0 +1,179 @@
+# Introduction
+
+While Docker Swarm is great for keeping containers running (_and restarting those that fail_), it does nothing for persistent storage. This means if you actually want your containers to keep any data persistent across restarts (_hint: you do!_), you need to provide shared storage to every docker node.
+
+## Design
+
+### Why not GlusterFS?
+I originally provided shared storage to my nodes using GlusterFS (see the next recipe for details), but found it difficult to deal with because:
+
+1. GlusterFS requires (n) "bricks", where (n) **has** to be a multiple of your replica count. I.e., if you want 2 copies of everything on shared storage (the minimum to provide redundancy), you **must** have either 2, 4, 6 (etc..) bricks. The HA swarm design calls for minimum of 3 nodes, and so under GlusterFS, my third node can't participate in shared storage at all, unless I start doubling up on bricks-per-node (which then impacts redundancy)
+2. GlusterFS turns out to be a giant PITA when you want to restore a failed node. There are at [least 14 steps to follow](https://access.redhat.com/documentation/en-US/Red_Hat_Storage/3/html/Administration_Guide/sect-Replacing_Hosts.html) to replace a brick.
+3. I'm pretty sure I messed up the 14-step process above anyway. My replaced brick synced with my "original" brick, but produced errors when querying status via the CLI, and hogged 100% of 1 CPU on the replaced node. Inexperienced with GlusterFS, and unable to diagnose the fault, I switched to a Ceph cluster instead.
+
+### Why Ceph?
+
+1. I'm more familiar with Ceph - I use it in the OpenStack designs I manage
+2. Replacing a failed node is **easy**, provided you can put up with the I/O load of rebalancing OSDs after the replacement.
+3. CentOS Atomic includes the ceph client in the OS, so while the Ceph OSD/Mon/MSD are running under containers, I can keep an eye (and later, automatically monitor) the status of Ceph from the base OS.
+
+## Ingredients
+
+!!! summary "Ingredients"
+    3 x Virtual Machines (configured earlier), each with:
+
+    * [X] CentOS/Fedora Atomic
+    * [X] At least 1GB RAM
+    * [X] At least 20GB disk space (_but it'll be tight_)
+    * [X] Connectivity to each other within the same subnet, and on a low-latency link (_i.e., no WAN links_)
+    * [ ] A second disk dedicated to the Ceph OSD
+
+## Preparation
+
+### SELinux
+
+Since our Ceph components will be containerized, we need to ensure the SELinux context on the base OS's ceph files is set correctly:
+
+```
+chcon -Rt svirt_sandbox_file_t /etc/ceph
+chcon -Rt svirt_sandbox_file_t /var/lib/ceph
+```
+### Setup Monitors
+
+Pick a node, and run the following to stand up the first Ceph mon. Be sure to replace the values for **MON_IP** and **CEPH_PUBLIC_NETWORK** to those specific to your deployment:
+
+```
+docker run -d --net=host \
+--restart always \
+-v /etc/ceph:/etc/ceph \
+-v /var/lib/ceph/:/var/lib/ceph/ \
+-e MON_IP=192.168.31.11 \
+-e CEPH_PUBLIC_NETWORK=192.168.31.0/24 \
+--name="ceph-mon" \
+ceph/daemon mon
+```
+
+Now **copy** the contents of /etc/ceph on this first node to the remaining nodes, and **then** run the docker command above on each remaining node. You'll end up with a cluster with 3 monitors (odd number is required for quorum, same as Docker Swarm), and no OSDs (yet)
+
+
+### Setup OSDs
+
+Since we have a OSD-less mon-only cluster currently, prepare for OSD creation by dumping the auth credentials for the OSDs into the appropriate location on the base OS:
+
+```
+ceph auth get client.bootstrap-osd -o \
+/var/lib/ceph/bootstrap-osd/ceph.keyring
+```
+
+On each node, you need a dedicated disk for the OSD. In the example below, I used _/dev/vdd_ (the entire disk, no partitions) for the OSD.
+
+Run the following command on every node:
+
+```
+docker run -d --net=host \
+--privileged=true \
+--pid=host \
+-v /etc/ceph:/etc/ceph \
+-v /var/lib/ceph/:/var/lib/ceph/ \
+-v /dev/:/dev/ \
+-e OSD_DEVICE=/dev/vdd \
+-e OSD_TYPE=disk \
+--name="ceph-osd" \
+--restart=always \
+ceph/daemon osd
+```
+
+Watch the output by running ```docker logs ceph-osd -f```, and confirm success.
+
+!!! note "Zapping the device"
+    The Ceph OSD container will refuse to destroy a partition containing existing data, so it may be necessary to "zap" the target disk, using:
+    ```    
+    docker run -d --privileged=true \
+    -v /dev/:/dev/ \
+    -e OSD_DEVICE=/dev/sdd \
+    ceph/daemon zap_device
+    ```
+
+### Setup MDSs
+
+In order to mount our ceph pools as filesystems, we'll need Ceph MDS(s). Run the following on each node:
+
+```
+docker run -d --net=host \
+--name ceph-mds \
+--restart always \
+-v /var/lib/ceph/:/var/lib/ceph/ \
+-v /etc/ceph:/etc/ceph \
+-e CEPHFS_CREATE=1 \
+-e CEPHFS_DATA_POOL_PG=256 \
+-e CEPHFS_METADATA_POOL_PG=256 \
+ceph/daemon mds
+```
+### Apply tweaks
+
+The ceph container seems to configure a pool default of 3 replicas (3 copies of each block are retained), which is one too many for our cluster (we are only protecting against the failure of a single node).
+
+Run the following on any node to reduce the size of the pool to 2 replicas:
+
+```
+ceph osd pool set cephfs_data size 2
+ceph osd pool set cephfs_metadata size 2
+```
+
+Disabled "scrubbing" (which can be IO-intensive, and is unnecessary on a VM) with:
+
+```
+ceph osd set noscrub
+ceph osd set nodeep-scrub
+```
+
+
+### Create credentials for swarm
+
+In order to mount the ceph volume onto our base host, we need to provide cephx authentication credentials.
+
+On **one** node, create a client for the docker swarm:
+
+```
+ceph auth get-or-create client.dockerswarm osd \
+'allow rw' mon 'allow r' mds 'allow' > /etc/ceph/keyring.dockerswarm
+```
+
+Grab the secret associated with the new user (you'll need this for the /etc/fstab entry below) by running:
+
+```
+ceph-authtool /etc/ceph/keyring.dockerswarm -p -n client.dockerswarm
+```
+
+### Mount MDS volume
+
+On each noie, create a mountpoint for the data, by running ```mkdir /var/data```, add an entry to fstab to ensure the volume is auto-mounted on boot, and ensure the volume is actually _mounted_ if there's a network / boot delay getting access to the gluster volume:
+
+```
+mkdir /var/data
+
+MYHOST=`hostname -s`
+echo -e "
+# Mount cephfs volume \n
+$MYHOST:6789:/      /var/data/      ceph      \
+name=dockerswarm\
+,secret=<YOUR SECRET HERE>\
+,noatime,_netdev,context=system_u:object_r:svirt_sandbox_file_t:s0\
+0 2" >> /etc/fstab
+mount -a
+```
+
+## Serving
+
+After completing the above, you should have:
+
+```
+[X] Persistent storage available to every node
+[X] Resiliency in the event of the failure of a single node
+```
+
+## Chef's Notes
+
+Future enhancements to this recipe include:
+
+1. Rather than pasting a secret key into /etc/fstab (which feels wrong), I'd prefer to be able to set "secretfile" in /etc/fstab (which just points ceph.mount to a file containing the secret), but under the current CentOS Atomic, we're stuck with "secret", per https://bugzilla.redhat.com/show_bug.cgi?id=1030402
diff --git a/docs/ha-docker-swarm/shared-storage.md b/docs/ha-docker-swarm/shared-storage-gluster.md
similarity index 94%
rename from docs/ha-docker-swarm/shared-storage.md
rename to docs/ha-docker-swarm/shared-storage-gluster.md
index db215bf..a11753e 100644
--- a/docs/ha-docker-swarm/shared-storage.md
+++ b/docs/ha-docker-swarm/shared-storage-gluster.md
@@ -2,6 +2,12 @@
 
 While Docker Swarm is great for keeping containers running (_and restarting those that fail_), it does nothing for persistent storage. This means if you actually want your containers to keep any data persistent across restarts (_hint: you do!_), you need to provide shared storage to every docker node.
 
+## Design
+
+### Why GlusterFS?
+
+This GlusterFS recipe was my original design for shared storage, but I [found it to be flawed](ha-docker-swarm/shared-storage-ceph/#why-not-glusterfs), and I replaced it with a [design which employs Ceph instead](http://localhost:8000/ha-docker-swarm/shared-storage-ceph/#why-ceph). This recipe is an alternate to the Ceph design, if you happen to prefer GlusterFS.
+
 ## Ingredients
 
 !!! summary "Ingredients"
diff --git a/docs/recipies/docker-mailserver.md b/docs/recipies/docker-mailserver.md
new file mode 100644
index 0000000..a00f0cf
--- /dev/null
+++ b/docs/recipies/docker-mailserver.md
@@ -0,0 +1,3 @@
+ mkdir {maildata,mailstate,config}
+
+ 
diff --git a/examples/ceph.sh b/examples/ceph.sh
new file mode 100644
index 0000000..1078d55
--- /dev/null
+++ b/examples/ceph.sh
@@ -0,0 +1,67 @@
+sudo chcon -Rt svirt_sandbox_file_t /etc/ceph
+sudo chcon -Rt svirt_sandbox_file_t /var/lib/ceph
+
+docker run -d --net=host \
+--privileged=true \
+--pid=host \
+-v /etc/ceph:/etc/ceph \
+-v /var/lib/ceph/:/var/lib/ceph/ \
+-v /dev/:/dev/ \
+-e OSD_DEVICE=/dev/vdd \
+-e OSD_TYPE=disk \
+--name="ceph-osd" \
+--restart=always \
+ceph/daemon osd
+
+
+
+docker run -d --net=host \
+--restart always \
+-v /etc/ceph:/etc/ceph \
+-v /var/lib/ceph/:/var/lib/ceph/ \
+-e MON_IP=192.168.31.11 \
+-e CEPH_PUBLIC_NETWORK=192.168.31.0/24 \
+--name="ceph-mon" \
+ceph/daemon mon
+
+On other nodes
+
+ceph auth get client.bootstrap-osd -o /var/lib/ceph/bootstrap-osd/ceph.keyring
+
+
+docker run -d --net=host \
+--name ceph-mds \
+--restart always \
+-v /var/lib/ceph/:/var/lib/ceph/ \
+-v /etc/ceph:/etc/ceph \
+-e CEPHFS_CREATE=0 \
+ceph/daemon mds
+
+
+ceph auth get-or-create client.dockerswarm osd 'allow rw' mon 'allow r' mds 'allow' > /etc/ceph/keyring.dockerswarm
+
+ceph-authtool /etc/ceph/keyring.dockerswarm -p -n client.dockerswarm
+
+Note that current design seems to provide 3 replicas, which is probably overkill:
+
+[root@ds3 traefik]# ceph osd pool get cephfs_data size
+size: 3
+[root@ds3 traefik]#
+
+
+So I set it to 2
+
+[root@ds3 traefik]# ceph osd pool set cephfs_data size 2
+set pool 1 size to 2
+[root@ds3 traefik]# ceph osd pool get cephfs_data size
+size: 2
+[root@ds3 traefik]#
+
+Would like to be able to set secretfile in /etc/fstab, but for now it loosk like we're stuch with --secret, per https://bugzilla.redhat.com/show_bug.cgi?id=1030402
+
+Euught. ceph writes are slow (surprise!)
+
+I disabled scrubbing with:
+
+ceph osd set noscrub
+ceph osd set nodeep-scrub
diff --git a/mkdocs.yml b/mkdocs.yml
index 2685e87..d09bd78 100644
--- a/mkdocs.yml
+++ b/mkdocs.yml
@@ -20,7 +20,8 @@ pages:
     - HA Docker Swarm:
       - Design: ha-docker-swarm/design.md
       - VMs: ha-docker-swarm/vms.md
-      - Shared Storage: ha-docker-swarm/shared-storage.md
+      - Shared Storage (Ceph): ha-docker-swarm/shared-storage-ceph.md
+      - Shared Storage (GlusterFS): ha-docker-swarm/shared-storage-gluster.md
       - Keepalived: ha-docker-swarm/keepalived.md
       - Docker Swarm Mode: ha-docker-swarm/docker-swarm-mode.md
       - Traefik: ha-docker-swarm/traefik.md