From 7377f522b0159302f50fd22c937e6f1444f7cbbb Mon Sep 17 00:00:00 2001 From: David Young Date: Mon, 17 Jul 2017 10:21:08 +1200 Subject: [PATCH] Updated design after suffering power failured --- docs/ha-docker-swarm/design.md | 10 ++++++++++ 1 file changed, 10 insertions(+) diff --git a/docs/ha-docker-swarm/design.md b/docs/ha-docker-swarm/design.md index 306ab40..101373e 100644 --- a/docs/ha-docker-swarm/design.md +++ b/docs/ha-docker-swarm/design.md @@ -62,3 +62,13 @@ When the failed (or upgraded) host is restored to service, the following is illu ![HA function](images/docker-swarm-node-restore.png) + +### Total cluster failure + +A day after writing this, my environment suffered a fault whereby all 3 VMs were unexpectedly and simultaneously powered off. + +Upon restore, docker failed to start on one of the VMs due to local disk space issue[^1]. However, the other two VMs started, established the swarm, mounted their shared storage, and started up all the containers (services) which were managed by the swarm. + +In summary, although I suffered an **unplanned power outage to all of my infrastructure**, followed by a **failure of a third of my hosts**... ==all my platforms are 100% available with **absolutely no manual intervention**==. + +[^1]: Since there's no impact to availability, I can fix (or just reinstall) the failed node whenever convenient.