Updated doc structure (#9)

2025-12-13 01:36:23 +00:00 · 2017-08-04 22:34:41 +12:00
parent 05a146f11c
commit e9d0bb822e
89 changed files with 25 additions and 10363 deletions
--- a/manuscript/ha-docker-swarm/design.md
+++ b/manuscript/ha-docker-swarm/design.md
@@ -0,0 +1,88 @@
+# Design
+
+In the design described below, the "private cloud" platform is:
+
+* **Highly-available** (_can tolerate the failure of a single component_)
+* **Scalable** (_can add resource or capacity as required_)
+* **Portable** (_run it on your garage server today, run it in AWS tomorrow_)
+* **Secure** (_access protected with LetsEncrypt certificates_)
+* **Automated** (_requires minimal care and feeding_)
+
+## Design Decisions
+
+**Where possible, services will be highly available.**
+
+This means that:
+
+* At least 3 docker swarm manager nodes are required, to provide fault-tolerance of a single failure.
+* GlusterFS is employed for share filesystem, because it too can be made tolerant of a single failure.
+
+**Where multiple solutions to a requirement exist, preference will be given to the most portable solution.**
+
+This means that:
+
+* Services are defined using docker-compose v3 YAML syntax
+* Services are portable, meaning a particular stack could be shut down and moved to a new provider with minimal effort.
+
+## Security
+
+Under this design, the only inbound connections we're permitting to our docker swarm are:
+
+### Network Flows
+
+* HTTP (TCP 80) : Redirects to https
+* HTTPS (TCP 443) : Serves individual docker containers via SSL-encrypted reverse proxy
+
+### Authentication
+
+* Where the proxied application provides a trusted level of authentication, or where the application requires public exposure,
+
+
+## High availability
+
+### Normal function
+
+Assuming 3 nodes, under normal circumstances the following is illustrated:
+
+* All 3 nodes provide shared storage via GlusterFS, which is provided by a docker container on each node. (i.e., not running in swarm mode)
+* All 3 nodes participate in the Docker Swarm as managers.
+* The various containers belonging to the application "stacks" deployed within Docker Swarm are automatically distributed amongst the swarm nodes.
+* Persistent storage for the containers is provide via GlusterFS mount.
+* The **traefik** service (in swarm mode) receives incoming requests (on http and https), and forwards them to individual containers. Traefik knows the containers names because it's able to access the docker socket.
+* All 3 nodes run keepalived, at different priorities. Since traefik is running as a swarm service and listening on TCP 80/443, requests made to the keepalived VIP and arriving at **any** of the swarm nodes will be forwarded to the traefik container (no matter which node it's on), and then onto the target backend.
+
+![HA function](images/docker-swarm-ha-function.png)
+
+### Node failure
+
+In the case of a failure (or scheduled maintenance) of one of the nodes, the following is illustrated:
+
+* The failed node no longer participates in GlusterFS, but the remaining nodes provide enough fault-tolerance for the cluster to operate.
+* The remaining two nodes in Docker Swarm achieve a quorum and agree that the failed node is to be removed.
+* The (possibly new) leader manager node reschedules the containers known to be running on the failed node, onto other nodes.
+* The **traefik** service is either restarted or unaffected, and as the backend containers stop/start and change IP, traefik is aware and updates accordingly.
+* The keepalived VIP continues to function on the remaining nodes, and docker swarm continues to forward any traffic received on TCP 80/443 to the appropriate node.
+
+![HA function](images/docker-swarm-node-failure.png)
+
+### Node restore
+
+When the failed (or upgraded) host is restored to service, the following is illustrated:
+
+* GlusterFS regains full redundancy
+* Docker Swarm managers become aware of the recovered node, and will use it for scheduling **new** containers
+* Existing containers which were migrated off the node are not migrated backend
+* Keepalived VIP regains full redundancy
+
+
+![HA function](images/docker-swarm-node-restore.png)
+
+### Total cluster failure
+
+A day after writing this, my environment suffered a fault whereby all 3 VMs were unexpectedly and simultaneously powered off.
+
+Upon restore, docker failed to start on one of the VMs due to local disk space issue[^1]. However, the other two VMs started, established the swarm, mounted their shared storage, and started up all the containers (services) which were managed by the swarm.
+
+In summary, although I suffered an **unplanned power outage to all of my infrastructure**, followed by a **failure of a third of my hosts**... ==all my platforms are 100% available with **absolutely no manual intervention**==.
+
+[^1]: Since there's no impact to availability, I can fix (or just reinstall) the failed node whenever convenient.
--- a/manuscript/ha-docker-swarm/docker-swarm-mode.md
+++ b/manuscript/ha-docker-swarm/docker-swarm-mode.md
@@ -0,0 +1,252 @@
+# Docker Swarm Mode
+
+For truly highly-available services with Docker containers, we need an orchestration system. Docker Swarm (as defined at 1.13) is the simplest way to achieve redundancy, such that a single docker host could be turned off, and none of our services will be interrupted.
+
+## Ingredients
+
+* 3 x CentOS Atomic hosts (bare-metal or VMs). A reasonable minimum would be:
+* 1 x vCPU
+* 1GB repo_name
+* 10GB HDD
+* Hosts must be within the same subnet, and connected on a low-latency link (i.e., no WAN links)
+
+## Preparation
+
+### Release the swarm!
+
+Now, to launch my swarm:
+
+```docker swarm init```
+
+Yeah, that was it. Now I have a 1-node swarm.
+
+```
+[root@ds1 ~]# docker swarm init
+Swarm initialized: current node (b54vls3wf8xztwfz79nlkivt8) is now a manager.
+
+To add a worker to this swarm, run the following command:
+
+    docker swarm join \
+    --token SWMTKN-1-2orjbzjzjvm1bbo736xxmxzwaf4rffxwi0tu3zopal4xk4mja0-bsud7xnvhv4cicwi7l6c9s6l0 \
+    202.170.164.47:2377
+
+To add a manager to this swarm, run 'docker swarm join-token manager' and follow the instructions.
+
+[root@ds1 ~]#
+```
+
+Run ```docker node ls``` to confirm that I have a 1-node swarm:
+
+```
+[root@ds1 ~]# docker node ls
+ID                           HOSTNAME                STATUS  AVAILABILITY  MANAGER STATUS
+b54vls3wf8xztwfz79nlkivt8 *  ds1.funkypenguin.co.nz  Ready   Active        Leader
+[root@ds1 ~]#
+```
+
+Note that when I ran ```docker swarm init``` above, the CLI output gave me a command to run to join further nodes to my swarm. This would join the nodes as __workers__ (as opposed to __managers__). Workers can easily be promoted to managers (and demoted again), but since we know that we want our other two nodes to be managers too, it's simpler just to add them to the swarm as managers immediately.
+
+On the first swarm node, generate the necessary token to join another manager by running ```docker swarm join-token manager```:
+
+```
+[root@ds1 ~]# docker swarm join-token manager
+To add a manager to this swarm, run the following command:
+
+    docker swarm join \
+    --token SWMTKN-1-2orjbzjzjvm1bbo736xxmxzwaf4rffxwi0tu3zopal4xk4mja0-cfm24bq2zvfkcwujwlp5zqxta \
+    202.170.164.47:2377
+
+[root@ds1 ~]#
+```
+
+Run the command provided on your second node to join it to the swarm as a manager. After adding the second node, the output of ```docker node ls``` (on either host) should reflect two nodes:
+
+
+````
+[root@ds2 davidy]# docker node ls
+ID                           HOSTNAME                STATUS  AVAILABILITY  MANAGER STATUS
+b54vls3wf8xztwfz79nlkivt8    ds1.funkypenguin.co.nz  Ready   Active        Leader
+xmw49jt5a1j87a6ihul76gbgy *  ds2.funkypenguin.co.nz  Ready   Active        Reachable
+[root@ds2 davidy]#
+````
+
+Repeat the process to add your third node. **You need a new token for the third node, don't re-use the manager token you generated for the second node**.
+
+!!! warning "Seriously. Don't use a token more than once, else it's swarm-rebuilding time."
+
+Finally, ```docker node ls``` should reflect that you have 3 reachable manager nodes, one of whom is the "Leader":
+
+```
+[root@ds3 ~]# docker node ls
+ID                           HOSTNAME                      STATUS  AVAILABILITY  MANAGER STATUS
+36b4twca7i3hkb7qr77i0pr9i    ds1.openstack.dev.safenz.net  Ready   Active        Reachable
+l14rfzazbmibh1p9wcoivkv1s *  ds3.openstack.dev.safenz.net  Ready   Active        Reachable
+tfsgxmu7q23nuo51wwa4ycpsj    ds2.openstack.dev.safenz.net  Ready   Active        Leader
+[root@ds3 ~]#
+```
+
+### Create registry mirror
+
+Although we now have shared storage for our persistent container data, our docker nodes don't share any other docker data, such as container images. This results in an inefficiency - every node which participates in the swarm will, at some point, need the docker image for every container deployed in the swarm.
+
+When dealing with large container (looking at you, GitLab!), this can result in several gigabytes of wasted bandwidth per-node, and long delays when restarting containers on an alternate node. (_It also wastes disk space on each node, but we'll get to that in the next section_)
+
+The solution is to run an official Docker registry container as a ["pull-through" cache, or "registry mirror"](https://docs.docker.com/registry/recipes/mirror/). By using our persistent storage for the registry cache, we can ensure we have a single copy of all the containers we've pulled at least once. After the first pull, any subsequent pulls from our nodes will use the cached version from our registry mirror. As a result, services are available more quickly when restarting container nodes, and we can be more aggressive about cleaning up unused containers on our nodes (more later)
+
+The registry mirror runs as a swarm stack, using a simple docker-compose.yml. Customize __your mirror FQDN__ below, so that Traefik will generate the appropriate LetsEncrypt certificates for it, and make it available via HTTPS.
+
+```
+version: "3"
+
+services:
+
+  registry-mirror:
+    image: registry:2
+    networks:
+      - traefik
+    deploy:
+      labels:
+        - traefik.frontend.rule=Host:<your mirror FQDN>
+        - traefik.docker.network=traefik
+        - traefik.port=5000
+    ports:
+      - 5000:5000
+    volumes:
+      - /var/data/registry/registry-mirror-data:/var/lib/registry
+      - /var/data/registry/registry-mirror-config.yml:/etc/docker/registry/config.yml
+
+networks:
+  traefik:
+    external: true
+```
+
+!!! note "Unencrypted registry"
+    We create this registry without consideration for SSL, which will fail if we attempt to use the registry directly. However, we're going to use the HTTPS-proxied version via Traefik, leveraging Traefik to manage the LetsEncrypt certificates required.
+
+
+Create registry/registry-mirror-config.yml as follows:
+```
+version: 0.1
+log:
+  fields:
+    service: registry
+storage:
+  cache:
+    blobdescriptor: inmemory
+  filesystem:
+    rootdirectory: /var/lib/registry
+  delete:
+    enabled: true
+http:
+  addr: :5000
+  headers:
+    X-Content-Type-Options: [nosniff]
+health:
+  storagedriver:
+    enabled: true
+    interval: 10s
+    threshold: 3
+proxy:
+  remoteurl: https://registry-1.docker.io
+```
+
+### Enable registry mirror and experimental features
+
+To tell docker to use the registry mirror, and in order to be able to watch the logs of any service from any manager node (_an experimental feature in the current Atomic docker build_), edit **/etc/docker-latest/daemon.json** on each node, and change from:
+
+```
+{
+    "log-driver": "journald",
+    "signature-verification": false
+}
+```
+
+To:
+
+```
+{
+    "log-driver": "journald",
+    "signature-verification": false,
+    "experimental": true,
+    "registry-mirrors": ["https://<your registry mirror FQDN>"]
+}
+```
+
+!!! tip ""
+    Note the extra comma required after "false" above
+
+### Setup automated cleanup
+
+This needs to be a docker-compose.yml file, excluding trusted images (like glusterfs, traefik, etc)
+```
+docker run -d  \
+-v /var/run/docker.sock:/var/run/docker.sock:rw \
+-v /var/lib/docker:/var/lib/docker:rw  \
+meltwater/docker-cleanup:latest
+```
+
+### Tweaks
+
+Add some handy bash auto-completion for docker. Without this, you'll get annoyed that you can't autocomplete ```docker stack deploy <blah> -c <blah.yml>``` commands.
+
+```
+cd /etc/bash_completion.d/
+curl -O https://raw.githubusercontent.com/docker/cli/b75596e1e4d5295ac69b9934d1bd8aff691a0de8/contrib/completion/bash/docker
+```
+
+Install some useful bash aliases on each host
+```
+cd ~
+curl -O https://gitlab.funkypenguin.co.nz/funkypenguin/geeks-cookbook-recipies/raw/master/bash/gcb-aliases.sh
+echo 'source ~/gcb-aliases.sh' >> ~/.bash_profile
+```
+
+
+
+
+````
+mkdir ~/dockersock
+cd ~/dockersock
+curl -O https://raw.githubusercontent.com/dpw/selinux-dockersock/master/Makefile
+curl -O https://raw.githubusercontent.com/dpw/selinux-dockersock/master/dockersock.te
+make && semodule -i dockersock.pp
+````
+
+## Setup registry
+
+docker run -d \
+  -p 5000:5000 \
+  --restart=always \
+  --name registry \
+  -v /mnt/registry:/var/lib/registry \
+  registry:2
+
+
+
+{
+"log-driver": "journald",
+"signature-verification": false,
+"experimental": true,
+"registry-mirrors": ["https://registry-mirror.funkypenguin.co.nz"]
+}
+
+
+
+  registry-mirror:
+    image: registry:2
+    ports:
+      - 5000:5000
+    environment:
+    volumes:
+      - /var/data/registry:/var/lib/registry
+
+
+
+      [root@ds1 dockersock]# docker swarm join-token manager
+      To add a manager to this swarm, run the following command:
+
+          docker swarm join \
+          --token SWMTKN-1-09c94wv0opw0y6xg67uzjl13pnv8lxxn586hrg5f47spso9l6j-6zn3dxk7c4zkb19r61owasi15 \
+          192.168.31.11:2377
+
+      [root@ds1 dockersock]#
--- a/manuscript/ha-docker-swarm/images/docker-swarm-ha-function.png
+++ b/manuscript/ha-docker-swarm/images/docker-swarm-ha-function.png
--- a/manuscript/ha-docker-swarm/images/docker-swarm-node-failure.png
+++ b/manuscript/ha-docker-swarm/images/docker-swarm-node-failure.png
--- a/manuscript/ha-docker-swarm/images/docker-swarm-node-restore.png
+++ b/manuscript/ha-docker-swarm/images/docker-swarm-node-restore.png
--- a/manuscript/ha-docker-swarm/images/shared-storage-replicated-gluster-volume.png
+++ b/manuscript/ha-docker-swarm/images/shared-storage-replicated-gluster-volume.png
--- a/manuscript/ha-docker-swarm/index.md
+++ b/manuscript/ha-docker-swarm/index.md
--- a/manuscript/ha-docker-swarm/keepalived.md
+++ b/manuscript/ha-docker-swarm/keepalived.md
@@ -0,0 +1,70 @@
+# Keepalived
+
+While having a self-healing, scalable docker swarm is great for availability and scalability, none of that is any good if nobody can connect to your cluster.
+
+In order to provide seamless external access to clustered resources, regardless of which node they're on and tolerant of node failure, you need to present a single IP to the world for external access.
+
+Normally this is done using a HA loadbalancer, but since Docker Swarm aready provides the load-balancing capabilities (routing mesh), all we need for seamless HA is a virtual IP which will be provided by more than one docker node.
+
+This is accomplished with the use of keepalived on at least two nodes.
+
+## Ingredients
+
+```
+Already deployed:
+[X] At least 2 x CentOS/Fedora Atomic VMs
+[X] low-latency link (i.e., no WAN links)
+
+New:
+[ ] 3 x IPv4 addresses (one for each node and one for the virtual IP)
+```
+
+## Preparation
+
+### Enable IPVS module
+
+On all nodes which will participate in keepalived, we need the "ip_vs" kernel module, in order to permit serivces to bind to non-local interface addresses.
+
+Set this up once for both the primary and secondary nodes, by running:
+
+```
+echo "modprobe ip_vs" >> /etc/rc.local
+modprobe ip_vs
+```
+
+### Setup nodes
+
+Assuming your IPs are as follows:
+
+* 192.168.4.1 : Primary
+* 192.168.4.2 : Secondary
+* 192.168.4.3 : Virtual
+
+Run the following on the primary
+```
+docker run -d --name keepalived --restart=always \
+  --cap-add=NET_ADMIN --net=host \
+  -e KEEPALIVED_UNICAST_PEERS="#PYTHON2BASH:['192.168.4.1', '192.168.4.2']" \
+  -e KEEPALIVED_VIRTUAL_IPS=192.168.4.3 \
+  -e KEEPALIVED_PRIORITY=200 \
+  osixia/keepalived:1.3.5
+```
+
+And on the secondary:
+```
+docker run -d --name keepalived --restart=always \
+  --cap-add=NET_ADMIN --net=host \
+  -e KEEPALIVED_UNICAST_PEERS="#PYTHON2BASH:['192.168.4.1', '192.168.4.2']" \
+  -e KEEPALIVED_VIRTUAL_IPS=192.168.4.3 \
+  -e KEEPALIVED_PRIORITY=100 \
+  osixia/keepalived:1.3.5
+```
+
+## Serving
+
+That's it. Each node will talk to the other via unicast (no need to un-firewall multicast addresses), and the node with the highest priority gets to be the master. When ingress traffic arrives on the master node via the VIP, docker's routing mesh will deliver it to the appropriate docker node.
+
+## Chef's notes
+
+1. Some hosting platforms (OpenStack, for one) won't allow you to simply "claim" a virtual IP. Each node is only able to receive traffic targetted to its unique IP. In this case, keepalived is not the right solution, and a platform-specific load-balancing solution should be used. In OpenStack, this is Neutron's "Load Balancer As A Service" (LBAAS) component. AWS and Azure would likely include similar protections.
+2. More than 2 nodes can participate in keepalived. Simply ensure that each node has the appropriate priority set, and the node with the highest priority will become the master.
--- a/manuscript/ha-docker-swarm/maintenance.md
+++ b/manuscript/ha-docker-swarm/maintenance.md
@@ -0,0 +1,83 @@
+# Introduction
+
+## Adding a host
+
+## Adding storage
+
+gluster volume add-brick VOLNAME NEW_BRICK
+
+example
+
+# gluster volume add-brick test-volume server4:/exp4
+Add Brick successful
+
+# Replacing failed host
+
+Followed https://access.redhat.com/documentation/en-US/Red_Hat_Storage/3/html/Administration_Guide/sect-Replacing_Hosts.html
+
+
+[root@glusterfs-server /]# gluster peer status
+Number of Peers: 1
+
+Hostname: ds1
+Uuid: db9c80da-11e4-461d-8ea5-66dd12ca897c
+State: Peer in Cluster (Disconnected)
+[root@glusterfs-server /]#
+
+Grab UUID above
+
+edit /var/lib/glusterd/glusterd.info
+change:
+UUID=aee45c2c-aa19-4d29-bc94-4833f2b22863
+to
+UUID=db9c80da-11e4-461d-8ea5-66dd12ca897c
+
+My peer's id (ds2):
+[root@glusterfs-server /]# gluster system:: uuid get
+UUID: 38ca4e8b-8ef5-4165-9f41-5c8b3f0103cc
+[root@glusterfs-server /]#
+
+vi /var/lib/glusterd/peers/38ca4e8b-8ef5-4165-9f41-5c8b3f0103cc
+
+UUID=38ca4e8b-8ef5-4165-9f41-5c8b3f0103cc
+state=3
+hostname=ds3
+
+
+
+Got volume info
+
+
+[root@glusterfs-server /]# gluster volume info
+
+Volume Name: gv0
+Type: Replicate
+Volume ID: 84e1169c-41dc-467a-9ae1-a474efaf789f
+Status: Started
+Snapshot Count: 0
+Number of Bricks: 1 x 2 = 2
+Transport-type: tcp
+Bricks:
+Brick1: ds1:/var/no-direct-write-here/brick1/gv0
+Brick2: ds3:/var/no-direct-write-here/brick1/gv0
+Options Reconfigured:
+nfs.disable: on
+transport.address-family: inet
+[root@glusterfs-server /]#
+
+
+
+----
+[root@glusterfs-server /]# getfattr -d -m. -ehex /var/no-direct-write-here/brick1/gv0/
+getfattr: Removing leading '/' from absolute path names
+# file: var/no-direct-write-here/brick1/gv0/
+security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6c6162656c65645f743a733000
+trusted.gfid=0x00000000000000000000000000000001
+trusted.glusterfs.dht=0x000000010000000000000000ffffffff
+trusted.glusterfs.volume-id=0x84e1169c41dc467a9ae1a474efaf789f
+
+[root@glusterfs-server /]#
+
+
+
+setfattr -n trusted.glusterfs.volume-id -v 0x84e1169c41dc467a9ae1a474efaf789f /var/no-direct-write-here/brick1/gv0
--- a/manuscript/ha-docker-swarm/shared-storage-ceph.md
+++ b/manuscript/ha-docker-swarm/shared-storage-ceph.md
@@ -0,0 +1,186 @@
+# Shared Storage (Ceph)
+
+While Docker Swarm is great for keeping containers running (_and restarting those that fail_), it does nothing for persistent storage. This means if you actually want your containers to keep any data persistent across restarts (_hint: you do!_), you need to provide shared storage to every docker node.
+
+## Design
+
+### Why not GlusterFS?
+I originally provided shared storage to my nodes using GlusterFS (see the next recipe for details), but found it difficult to deal with because:
+
+1. GlusterFS requires (n) "bricks", where (n) **has** to be a multiple of your replica count. I.e., if you want 2 copies of everything on shared storage (the minimum to provide redundancy), you **must** have either 2, 4, 6 (etc..) bricks. The HA swarm design calls for minimum of 3 nodes, and so under GlusterFS, my third node can't participate in shared storage at all, unless I start doubling up on bricks-per-node (which then impacts redundancy)
+2. GlusterFS turns out to be a giant PITA when you want to restore a failed node. There are at [least 14 steps to follow](https://access.redhat.com/documentation/en-US/Red_Hat_Storage/3/html/Administration_Guide/sect-Replacing_Hosts.html) to replace a brick.
+3. I'm pretty sure I messed up the 14-step process above anyway. My replaced brick synced with my "original" brick, but produced errors when querying status via the CLI, and hogged 100% of 1 CPU on the replaced node. Inexperienced with GlusterFS, and unable to diagnose the fault, I switched to a Ceph cluster instead.
+
+### Why Ceph?
+
+1. I'm more familiar with Ceph - I use it in the OpenStack designs I manage
+2. Replacing a failed node is **easy**, provided you can put up with the I/O load of rebalancing OSDs after the replacement.
+3. CentOS Atomic includes the ceph client in the OS, so while the Ceph OSD/Mon/MSD are running under containers, I can keep an eye (and later, automatically monitor) the status of Ceph from the base OS.
+
+## Ingredients
+
+!!! summary "Ingredients"
+    3 x Virtual Machines (configured earlier), each with:
+
+    * [X] CentOS/Fedora Atomic
+    * [X] At least 1GB RAM
+    * [X] At least 20GB disk space (_but it'll be tight_)
+    * [X] Connectivity to each other within the same subnet, and on a low-latency link (_i.e., no WAN links_)
+    * [ ] A second disk dedicated to the Ceph OSD
+
+## Preparation
+
+### SELinux
+
+Since our Ceph components will be containerized, we need to ensure the SELinux context on the base OS's ceph files is set correctly:
+
+```
+chcon -Rt svirt_sandbox_file_t /etc/ceph
+chcon -Rt svirt_sandbox_file_t /var/lib/ceph
+```
+### Setup Monitors
+
+Pick a node, and run the following to stand up the first Ceph mon. Be sure to replace the values for **MON_IP** and **CEPH_PUBLIC_NETWORK** to those specific to your deployment:
+
+```
+docker run -d --net=host \
+--restart always \
+-v /etc/ceph:/etc/ceph \
+-v /var/lib/ceph/:/var/lib/ceph/ \
+-e MON_IP=192.168.31.11 \
+-e CEPH_PUBLIC_NETWORK=192.168.31.0/24 \
+--name="ceph-mon" \
+ceph/daemon mon
+```
+
+Now **copy** the contents of /etc/ceph on this first node to the remaining nodes, and **then** run the docker command above (_customizing MON_IP as you go_) on each remaining node. You'll end up with a cluster with 3 monitors (odd number is required for quorum, same as Docker Swarm), and no OSDs (yet)
+
+
+### Setup OSDs
+
+Since we have a OSD-less mon-only cluster currently, prepare for OSD creation by dumping the auth credentials for the OSDs into the appropriate location on the base OS:
+
+```
+ceph auth get client.bootstrap-osd -o \
+/var/lib/ceph/bootstrap-osd/ceph.keyring
+```
+
+On each node, you need a dedicated disk for the OSD. In the example below, I used _/dev/vdd_ (the entire disk, no partitions) for the OSD.
+
+Run the following command on every node:
+
+```
+docker run -d --net=host \
+--privileged=true \
+--pid=host \
+-v /etc/ceph:/etc/ceph \
+-v /var/lib/ceph/:/var/lib/ceph/ \
+-v /dev/:/dev/ \
+-e OSD_DEVICE=/dev/vdd \
+-e OSD_TYPE=disk \
+--name="ceph-osd" \
+--restart=always \
+ceph/daemon osd
+```
+
+Watch the output by running ```docker logs ceph-osd -f```, and confirm success.
+
+!!! note "Zapping the device"
+    The Ceph OSD container will refuse to destroy a partition containing existing data, so it may be necessary to "zap" the target disk, using:
+    ```    
+    docker run -d --privileged=true \
+    -v /dev/:/dev/ \
+    -e OSD_DEVICE=/dev/sdd \
+    ceph/daemon zap_device
+    ```
+
+### Setup MDSs
+
+In order to mount our ceph pools as filesystems, we'll need Ceph MDS(s). Run the following on each node:
+
+```
+docker run -d --net=host \
+--name ceph-mds \
+--restart always \
+-v /var/lib/ceph/:/var/lib/ceph/ \
+-v /etc/ceph:/etc/ceph \
+-e CEPHFS_CREATE=1 \
+-e CEPHFS_DATA_POOL_PG=256 \
+-e CEPHFS_METADATA_POOL_PG=256 \
+ceph/daemon mds
+```
+### Apply tweaks
+
+The ceph container seems to configure a pool default of 3 replicas (3 copies of each block are retained), which is one too many for our cluster (we are only protecting against the failure of a single node).
+
+Run the following on any node to reduce the size of the pool to 2 replicas:
+
+```
+ceph osd pool set cephfs_data size 2
+ceph osd pool set cephfs_metadata size 2
+```
+
+Disabled "scrubbing" (which can be IO-intensive, and is unnecessary on a VM) with:
+
+```
+ceph osd set noscrub
+ceph osd set nodeep-scrub
+```
+
+
+### Create credentials for swarm
+
+In order to mount the ceph volume onto our base host, we need to provide cephx authentication credentials.
+
+On **one** node, create a client for the docker swarm:
+
+```
+ceph auth get-or-create client.dockerswarm osd \
+'allow rw' mon 'allow r' mds 'allow' > /etc/ceph/keyring.dockerswarm
+```
+
+Grab the secret associated with the new user (you'll need this for the /etc/fstab entry below) by running:
+
+```
+ceph-authtool /etc/ceph/keyring.dockerswarm -p -n client.dockerswarm
+```
+
+### Mount MDS volume
+
+On each noie, create a mountpoint for the data, by running ```mkdir /var/data```, add an entry to fstab to ensure the volume is auto-mounted on boot, and ensure the volume is actually _mounted_ if there's a network / boot delay getting access to the gluster volume:
+
+```
+mkdir /var/data
+
+MYHOST=`hostname -s`
+echo -e "
+# Mount cephfs volume \n
+$MYHOST:6789:/      /var/data/      ceph      \
+name=dockerswarm\
+,secret=<YOUR SECRET HERE>\
+,noatime,_netdev,context=system_u:object_r:svirt_sandbox_file_t:s0\
+0 2" >> /etc/fstab
+mount -a
+```
+### Install docker-volume plugin
+
+Upstream bug for docker-latest reported at https://bugs.centos.org/view.php?id=13609
+
+And the alpine fault:
+https://github.com/gliderlabs/docker-alpine/issues/317
+
+
+## Serving
+
+After completing the above, you should have:
+
+```
+[X] Persistent storage available to every node
+[X] Resiliency in the event of the failure of a single node
+```
+
+## Chef's Notes
+
+Future enhancements to this recipe include:
+
+1. Rather than pasting a secret key into /etc/fstab (which feels wrong), I'd prefer to be able to set "secretfile" in /etc/fstab (which just points ceph.mount to a file containing the secret), but under the current CentOS Atomic, we're stuck with "secret", per https://bugzilla.redhat.com/show_bug.cgi?id=1030402
--- a/manuscript/ha-docker-swarm/shared-storage-gluster.md
+++ b/manuscript/ha-docker-swarm/shared-storage-gluster.md
@@ -0,0 +1,164 @@
+# Shared Storage (GlusterFS)
+
+While Docker Swarm is great for keeping containers running (_and restarting those that fail_), it does nothing for persistent storage. This means if you actually want your containers to keep any data persistent across restarts (_hint: you do!_), you need to provide shared storage to every docker node.
+
+## Design
+
+### Why GlusterFS?
+
+This GlusterFS recipe was my original design for shared storage, but I [found it to be flawed](ha-docker-swarm/shared-storage-ceph/#why-not-glusterfs), and I replaced it with a [design which employs Ceph instead](http://localhost:8000/ha-docker-swarm/shared-storage-ceph/#why-ceph). This recipe is an alternate to the Ceph design, if you happen to prefer GlusterFS.
+
+## Ingredients
+
+!!! summary "Ingredients"
+    3 x Virtual Machines (configured earlier), each with:
+
+    * [X] CentOS/Fedora Atomic
+    * [X] At least 1GB RAM
+    * [X] At least 20GB disk space (_but it'll be tight_)
+    * [X] Connectivity to each other within the same subnet, and on a low-latency link (_i.e., no WAN links_)
+    * [ ] A second disk, or adequate space on the primary disk for a dedicated data partition
+
+## Preparation
+
+### Create Gluster "bricks"
+
+To build our Gluster volume, we need 2 out of the 3 VMs to provide one "brick". The bricks will be used to create the replicated volume. Assuming a replica count of 2 (_i.e., 2 copies of the data are kept in gluster_), our total number of bricks must be divisible by our replica count. (_I.e., you can't have 3 bricks if you want 2 replicas. You can have 4 though - We have to have minimum 3 swarm manager nodes for fault-tolerance, but only 2 of those nodes need to run as gluster servers._)
+
+On each host, run a variation following to create your bricks, adjusted for the path to your disk.
+
+!!! note "The example below assumes /dev/vdb is dedicated to the gluster volume"
+```
+(
+echo o # Create a new empty DOS partition table
+echo n # Add a new partition
+echo p # Primary partition
+echo 1 # Partition number
+echo   # First sector (Accept default: 1)
+echo   # Last sector (Accept default: varies)
+echo w # Write changes
+) | sudo fdisk /dev/vdb
+
+mkfs.xfs -i size=512 /dev/vdb1
+mkdir -p /var/no-direct-write-here/brick1
+echo '' >> /etc/fstab >> /etc/fstab
+echo '# Mount /dev/vdb1 so that it can be used as a glusterfs volume' >> /etc/fstab
+echo '/dev/vdb1 /var/no-direct-write-here/brick1 xfs defaults 1 2' >> /etc/fstab
+mount -a && mount
+```
+
+!!! warning "Don't provision all your LVM space"
+    Atomic uses LVM to store docker data, and **automatically grows** Docker's volumes as requried. If you commit all your free LVM space to your brick, you'll quickly find (as I did) that docker will start to fail with error messages about insufficient space. If you're going to slice off a portion of your LVM space in /dev/atomicos, make sure you leave enough space for Docker storage, where "enough" depends on how much you plan to pull images, make volumes, etc. I ate through 20GB very quickly doing development, so I ended up provisioning 50GB for atomic alone, with a separate volume for the brick.
+
+### Create glusterfs container
+
+Atomic doesn't include the Gluster server components.  This means we'll have to run glusterd from within a container, with privileged access to the host. Although convoluted, I've come to prefer this design since it once again makes the OS "disposable", moving all the config into containers and code.
+
+Run the following on each host:
+````
+docker run \
+   -h glusterfs-server \
+   -v /etc/glusterfs:/etc/glusterfs:z \
+   -v /var/lib/glusterd:/var/lib/glusterd:z \
+   -v /var/log/glusterfs:/var/log/glusterfs:z \
+   -v /sys/fs/cgroup:/sys/fs/cgroup:ro \
+   -v /var/no-direct-write-here/brick1:/var/no-direct-write-here/brick1 \
+   -d --privileged=true --net=host \
+   --restart=always \
+   --name="glusterfs-server" \
+   gluster/gluster-centos
+````
+### Create trusted pool
+
+On a single node (doesn't matter which), run ```docker exec -it glusterfs-server bash``` to launch a shell inside the container.
+
+From the node, run
+```gluster peer probe <other host>```
+
+Example output:
+```
+[root@glusterfs-server /]# gluster peer probe ds1
+peer probe: success.
+[root@glusterfs-server /]#
+```
+
+Run ```gluster peer status``` on both nodes to confirm that they're properly connected to each other:
+
+Example output:
+```
+[root@glusterfs-server /]# gluster peer status
+Number of Peers: 1
+
+Hostname: ds3
+Uuid: 3e115ba9-6a4f-48dd-87d7-e843170ff499
+State: Peer in Cluster (Connected)
+[root@glusterfs-server /]#
+```
+
+### Create gluster volume
+
+Now we create a *replicated volume* out of our individual "bricks".
+
+Create the gluster volume by running
+```
+gluster volume create gv0 replica 2 \
+ server1:/var/no-direct-write-here/brick1 \
+ server2:/var/no-direct-write-here/brick1
+```
+
+Example output:
+```
+[root@glusterfs-server /]# gluster volume create gv0 replica 2 ds1:/var/no-direct-write-here/brick1/gv0  ds3:/var/no-direct-write-here/brick1/gv0
+volume create: gv0: success: please start the volume to access data
+[root@glusterfs-server /]#
+```
+
+Start the volume by running ```gluster volume start gv0```
+
+```
+[root@glusterfs-server /]# gluster volume start gv0
+volume start: gv0: success
+[root@glusterfs-server /]#
+```
+
+The volume is only present on the host you're shelled into though. To add the other hosts to the volume, run ```gluster peer probe <servername>```. Don't probe host from itself.
+
+From one other host, run ```docker exec -it glusterfs-server bash``` to shell into the gluster-server container, and run ```gluster peer probe <original server name>``` to update the name of the host which started the volume.
+
+### Mount gluster volume
+
+On the host (i.e., outside of the container - type ```exit``` if you're still shelled in), create a mountpoint for the data, by running ```mkdir /var/data```, add an entry to fstab to ensure the volume is auto-mounted on boot, and ensure the volume is actually _mounted_ if there's a network / boot delay getting access to the gluster volume:
+
+```
+mkdir /var/data
+MYHOST=`hostname -s`
+echo '' >> /etc/fstab >> /etc/fstab
+echo '# Mount glusterfs volume' >> /etc/fstab
+echo "$MYHOST:/gv0                /var/data      glusterfs       defaults,_netdev,context="system_u:object_r:svirt_sandbox_file_t:s0"  0  0"  >> /etc/fstab
+mount -a
+```
+
+For some reason, my nodes won't auto-mount this volume on boot. I even tried the trickery below, but they stubbornly refuse to automount.
+```
+echo -e "\n\n# Give GlusterFS 10s to start before \
+mounting\nsleep 10s && mount -a" >> /etc/rc.local
+systemctl enable rc-local.service
+```
+
+For non-gluster nodes, you'll need to replace $MYHOST above with the name of one of the gluster hosts (I haven't worked out how to make this fully HA yet)
+
+## Serving
+
+After completing the above, you should have:
+
+```
+[X] Persistent storage available to every node
+[X] Resiliency in the event of the failure of a single (gluster) node
+```
+
+## Chef's Notes
+
+Future enhancements to this recipe include:
+
+1. Migration of shared storage from GlusterFS to Ceph ()[#2](https://gitlab.funkypenguin.co.nz/funkypenguin/geeks-cookbook/issues/2))
+2. Correct the fact that volumes don't automount on boot ([#3](https://gitlab.funkypenguin.co.nz/funkypenguin/geeks-cookbook/issues/3))
--- a/manuscript/ha-docker-swarm/traefik.md
+++ b/manuscript/ha-docker-swarm/traefik.md
@@ -0,0 +1,146 @@
+# Traefik
+
+The platforms we plan to run on our cloud are generally web-based, and each listening on their own unique TCP port. When a container in a swarm exposes a port, then connecting to **any** swarm member on that port will result in your request being forwarded to the appropriate host running the container. (_Docker calls this the swarm "[routing mesh](https://docs.docker.com/engine/swarm/ingress/)"_)
+
+So we get a rudimentary load balancer built into swarm. We could stop there, just exposing a series of ports on our hosts, and making them HA using keepalived.
+
+There are some gaps to this approach though:
+
+- No consideration is given to HTTPS. Implementation would have to be done manually, per-container.
+- No mechanism is provided for authentication outside of that which the container providers. We may not **want** to expose every interface on every container to the world, especially if we are playing with tools or containers whose quality and origin are unknown.
+
+To deal with these gaps, we need a front-end load-balancer, and in this design, that role is provided by [Traefik](https://traefik.io/).
+
+## Ingredients
+
+## Preparation
+
+### Prepare the host
+
+The traefik container is aware of the __other__ docker containers in the swarm, because it has access to the docker socket at **/var/run/docker.sock**. This allows traefik to dynamically configure itself based on the labels found on containers in the swarm, which is hugely useful. To make this functionality work on our SELinux-enabled Atomic hosts, we need to add custom SELinux policy.
+
+Run the following to build and activate policy to permit containers to access docker.sock:
+
+```
+mkdir ~/dockersock
+cd ~/dockersock
+curl -O https://raw.githubusercontent.com/dpw/\
+selinux-dockersock/master/Makefile
+curl -O https://raw.githubusercontent.com/dpw/\
+selinux-dockersock/master/dockersock.te
+make && semodule -i dockersock.pp
+```
+
+### Prepare traefik.toml
+
+While it's possible to configure traefik via docker command arguments, I prefer to create a config file (traefik.toml). This allows me to change traefik's behaviour by simply changing the file, and keeps my docker config simple.
+
+Create /var/data/traefik/traefik.toml as follows:
+
+```
+checkNewVersion = true
+defaultEntryPoints = ["http", "https"]
+
+# This section enable LetsEncrypt automatic certificate generation / renewal
+[acme]
+email = "<your LetsEncrypt email address>"
+storage = "acme.json" # or "traefik/acme/account" if using KV store
+entryPoint = "https"
+acmeLogging = true
+onDemand = true
+OnHostRule = true
+
+[[acme.domains]]
+  main = "<your primary domain>"
+
+# Redirect all HTTP to HTTPS (why wouldn't you?)
+[entryPoints]
+  [entryPoints.http]
+  address = ":80"
+    [entryPoints.http.redirect]
+      entryPoint = "https"
+  [entryPoints.https]
+  address = ":443"
+    [entryPoints.https.tls]
+
+[web]
+address = ":8080"
+watch = true
+
+[docker]
+endpoint = "tcp://127.0.0.1:2375"
+domain = "<your primary domain>"
+watch = true
+swarmmode = true
+```
+
+### Prepare the docker service config
+
+Create /var/data/traefik/docker-compose.yml as follows:
+
+```
+version: "3.2"
+
+services:
+  traefik:
+    image: traefik
+    command: --web --docker --docker.swarmmode --docker.watch --docker.domain=funkypenguin.co.nz --logLevel=DEBUG
+    ports:
+      - target: 80
+        published: 80
+        protocol: tcp
+        mode: host
+      - target: 443
+        published: 443
+        protocol: tcp
+        mode: host
+      - target: 8080
+        published: 8080
+        protocol: tcp
+    volumes:
+      - /var/run/docker.sock:/var/run/docker.sock:ro
+      - /var/data/traefik/traefik.toml:/traefik.toml:ro
+      - /var/data/traefik/acme.json:/acme.json
+    labels:
+      - "traefik.enable=false"
+    networks:
+      - public
+    deploy:
+      mode: global
+      placement:
+        constraints: [node.role == manager]
+      restart_policy:
+        condition: on-failure
+
+networks:
+  public:
+    driver: overlay
+    ipam:
+      driver: default
+      config:
+      - subnet: 10.1.0.0/24
+```
+
+Docker won't start an image with a bind-mount to a non-existent file, so prepare acme.json by running ```touch /var/data/traefik/acme.json```.
+
+### Launch
+
+Deploy traefik with ```docker stack deploy traefik -c /var/data/traefik/docker-compose.yml```
+
+Confirm traefik is running with ```docker stack ps traefik```
+
+## Serving
+
+You now have:
+
+1. Frontend proxy which will dynamically configure itself for new backend containers
+2. Automatic SSL support for all proxied resources
+
+
+## Chef's Notes
+
+Additional features I'd like to see in this recipe are:
+
+1. Include documentation of oauth2_proxy container for protecting individual backends
+2. Traefik webUI is available via HTTPS, protected with oauth_proxy
+3. Pending a feature in docker-swarm to avoid NAT on routing-mesh-delivered traffic, update the design
--- a/manuscript/ha-docker-swarm/vms.md
+++ b/manuscript/ha-docker-swarm/vms.md
@@ -0,0 +1,80 @@
+# Virtual Machines
+
+Let's start building our cloud with virtual machines. You could use bare-metal machines as well, the configuration would be the same. Given that most readers (myself included) will be using virtual infrastructure, from now on I'll be referring strictly to VMs.
+
+I chose the "[Atomic](https://www.projectatomic.io/)" CentOS/Fedora image for the VM layer because:
+
+1. I want less responsibility for maintaining the system, including ensuring regular software updates and reboots. Atomic's idempotent nature means the OS is largely real-only, and updates/rollbacks are "atomic" (haha) procedures, which can be easily rolled back if required.
+2. For someone used to administrating servers individually, Atomic is a PITA. You have to employ [tricky](atomic-trick2) [tricks](atomic-trick1) to get it to install in a non-cloud environment. It's not designed for tweaking or customizing beyond what cloud-config is capable of. For my purposes, this is good, because it forces me to change my thinking - to consider every daemon as a container, and every config as code, to be checked in and version-controlled. Atomic forces this thinking on you.
+3. I want the design to be as "portable" as possible. While I run it on VPSs now, I may want to migrate it to a "cloud" provider in the future, and I'll want the most portable, reproducible design.
+
+[atomic-trick1]:https://spinningmatt.wordpress.com/2014/01/08/a-recipe-for-starting-cloud-images-with-virt-install/
+[atomic-trick2]:http://blog.oddbit.com/2015/03/10/booting-cloud-images-with-libvirt/
+
+## Ingredients
+
+!!! summary "Ingredients"
+    3 x Virtual Machines, each with:
+
+    * [ ] CentOS/Fedora Atomic
+    * [ ] At least 1GB RAM
+    * [ ] At least 20GB disk space (_but it'll be tight_)
+    * [ ] Connectivity to each other within the same subnet, and on a low-latency link (_i.e., no WAN links_)
+
+
+## Preparation
+
+### Install Virtual machines
+
+1. Install / launch virtual machines.
+2. The default username on CentOS atomic is "centos", and you'll have needed to supply your SSH key during the build process.
+
+!!! tip
+    If you're not using a platform with cloud-init support (i.e., you're building a VM manually, not provisioning it through a cloud provider), you'll need to refer to [trick #1][atomic-trick1] and [#2][atomic-trick2] for a means to override the automated setup, apply a manual password to the CentOS account, and enable SSH password logins.
+
+
+### Prefer docker-latest
+
+Run the following on each node to replace the default docker 1.12 with docker 1.13 (_which we need for swarm mode_):
+```
+systemctl disable docker --now
+systemctl enable docker-latest --now
+sed -i '/DOCKERBINARY/s/^#//g' /etc/sysconfig/docker
+```
+
+
+### Upgrade Atomic
+
+Finally, apply any Atomic host updates, and reboot, by running: ```atomic host upgrade && systemctl reboot```.
+
+
+### Permit connectivity between VMs
+
+By default, Atomic only permits incoming SSH. We'll want to allow all traffic between our nodes, so add something like this to /etc/sysconfig/iptables:
+
+```
+# Allow all inter-node communication
+-A INPUT -s 192.168.31.0/24 -j ACCEPT
+```
+
+And restart iptables with ```systemctl restart iptables```
+
+### Enable host resolution
+
+Depending on your hosting environment, you may have DNS automatically setup for your VMs. If not, it's useful to set up static entries in /etc/hosts for the nodes. For example, I setup the following:
+
+```
+192.168.31.11   ds1     ds1.funkypenguin.co.nz
+192.168.31.12   ds2     ds2.funkypenguin.co.nz
+192.168.31.13   ds3     ds3.funkypenguin.co.nz
+```
+
+
+## Serving
+
+After completing the above, you should have:
+
+```
+[X] 3 x fresh atomic instances, at the latest releases,
+    running Docker v1.13 (docker-latest)
+```