Updated details on VM builds

2025-12-13 01:36:23 +00:00 · 2017-07-16 22:59:04 +12:00
parent 5548cccb24
commit d9897c5807
7 changed files with 93 additions and 22 deletions
--- a/docs/ha-docker-swarm/design.md
+++ b/docs/ha-docker-swarm/design.md
@@ -1,7 +1,64 @@
-The "private cloud" platform is:
+# Introduction

-* **Highly-available** (can tolerate the failure of a single component)
-* **Scalable** (can add resource or capacity as required)
-* **Portable** (run it on your garage server today, run it in AWS tomorrow)
-* **Secure** (access protected with LetsEncrypt certificates)
-* **Automated** (requires minimal care and feeding)
+In the design described below, the "private cloud" platform is:
+
+* **Highly-available** (_can tolerate the failure of a single component_)
+* **Scalable** (_can add resource or capacity as required_)
+* **Portable** (_run it on your garage server today, run it in AWS tomorrow_)
+* **Secure** (_access protected with LetsEncrypt certificates_)
+* **Automated** (_requires minimal care and feeding_)
+
+## Design Decisions
+
+**Where possible, services will be highly available.**
+
+This means that:
+
+* At least 3 docker swarm manager nodes are required, to provide fault-tolerance of a single failure.
+* GlusterFS is employed for share filesystem, because it too can be made tolerant of a single failure.
+
+**Where multiple solutions to a requirement exist, preference will be given to the most portable solution.**
+
+This means that:
+
+* Services are defined using docker-compose v3 YAML syntax
+* Services are portable, meaning a particular stack could be shut down and moved to a new provider with minimal effort.
+
+## High availability
+
+### Normal function
+
+Assuming 3 nodes, under normal circumstances the following is illustrated:
+
+* All 3 nodes provide shared storage via GlusterFS, which is provided by a docker container on each node. (i.e., not running in swarm mode)
+* All 3 nodes participate in the Docker Swarm as managers.
+* The various containers belonging to the application "stacks" deployed within Docker Swarm are automatically distributed amongst the swarm nodes.
+* Persistent storage for the containers is provide via GlusterFS mount.
+* The **traefik** service (in swarm mode) receives incoming requests (on http and https), and forwards them to individual containers. Traefik knows the containers names because it's able to access the docker socket.
+* All 3 nodes run keepalived, at different priorities. Since traefik is running as a swarm service and listening on TCP 80/443, requests made to the keepalived VIP and arriving at **any** of the swarm nodes will be forwarded to the traefik container (no matter which node it's on), and then onto the target backend.
+
+![HA function](images/docker-swarm-ha-function.png)
+
+### Node failure
+
+In the case of a failure (or scheduled maintenance) of one of the nodes, the following is illustrated:
+
+* The failed node no longer participates in GlusterFS, but the remaining nodes provide enough fault-tolerance for the cluster to operate.
+* The remaining two nodes in Docker Swarm achieve a quorum and agree that the failed node is to be removed.
+* The (possibly new) leader manager node reschedules the containers known to be running on the failed node, onto other nodes.
+* The **traefik** service is either restarted or unaffected, and as the backend containers stop/start and change IP, traefik is aware and updates accordingly.
+* The keepalived VIP continues to function on the remaining nodes, and docker swarm continues to forward any traffic received on TCP 80/443 to the appropriate node.
+
+![HA function](images/docker-swarm-node-failure.png)
+
+### Node restore
+
+When the failed (or upgraded) host is restored to service, the following is illustrated:
+
+* GlusterFS regains full redundancy
+* Docker Swarm managers become aware of the recovered node, and will use it for scheduling **new** containers
+* Existing containers which were migrated off the node are not migrated backend
+* Keepalived VIP regains full redundancy
+
+
+![HA function](images/docker-swarm-node-restore.png)