I am a cover!

Table of contents

Let's build your awesome selfhosted platform together!

Welcome, fellow geek πŸ‘‹ If you're impatient, just start here πŸ‘‡

What to expect

The "Geek Cookbook" is a collection of how-to guides for establishing your own container-based awesome selfhosted platform, using either Docker Swarm or Kubernetes.

Running such a platform enables you to run selfhosted services such as the AutoPirate (Radarr, Sonarr, NZBGet and friends) stack, Plex, NextCloud etc, and includes elements such as:

Recent updates and additions are posted on the CHANGELOG, and there's a friendly community of like-minded geeks in the Discord server.

How will this benefit me?

You already have a familiarity with concepts such as virtual machines, Docker containers, LetsEncrypt SSL certificates, databases, and command-line interfaces.

You've probably played with self-hosting some mainstream apps yourself, like Plex, NextCloud, Wordpress or Ghost.

So if you're familiar enough with the concepts above, and you've done self-hosting before, why would you read any further?

  1. You want to upskill. You want to work with container orchestration, Prometheus and Grafana, Kubernetes
  2. You want to play. You want a safe sandbox to test new tools, keeping the ones you want and tossing the ones you don't.
  3. You want reliability. Once you go from playing with a tool to actually using it, you want it to be available when you need it. Having to "quickly ssh into the basement server and restart plex" doesn't cut it when you finally convince your wife to sit down with you to watch sci-fi πŸ€–

Testimonials

...how useful the recipes are for people just getting started with containers...

"One of the surprising realizations from following Funky Penguins cookbooks for so long is how useful the recipes are for people just getting started with containers and how it gives them real, interesting usecases to attach to their learning" - DevOps Daniel (@DanielSHouston)

He unblocked me on all the technical hurdles to launching my SaaS in GKE!

By the time I had enlisted Funky Penguin's help, I'd architected myself into a bit of a nightmare with Kubernetes. I knew what I wanted to achieve, but I'd made a mess of it. Funky Penguin (David) was able to jump right in and offer a vital second-think on everything I'd done, pointing out where things could be simplified and streamlined, and better alternatives.

He unblocked me on all the technical hurdles to launching my SaaS in GKE!

With him delivering the container/Kubernetes architecture and helm CI/CD workflow, I was freed up to focus on coding and design, which fast-tracked me to launching on time. And now I have a simple deployment process that is easy for me to execute and maintain as a solo founder.

I have no hesitation in recommending him for your project, and I'll certainly be calling on him again in the future.

-- John McDowall, Founder, kiso.io

Who made this?

πŸ‘‹ Hi, I'm David

I’ve spent 20+ years working with technology. I’m a solution architect, with a broad range of experience and skills. I'm a full-time AWS Certified Solution Architect (Professional), a CNCF-Certified Kubernetes Administrator, Application Developer and Security Specialist.

What do you want from me?

I want your support, either in the financial sense, or as a member of our friendly geek community (or both!)

Get in touch πŸ’¬

Sponsor me ❀️

The best way to support this work is to become a GitHub Sponsor / Patreon patron. You get:

.. and I get some pocket money every month to buy wine, cheese, and cryptocurrency! 🍷 πŸ’°

Impulsively click here (NOW quick do it!) to sponsor me via GitHub, or patronize me via Patreon!

Work with me 🀝

Need some Cloud / Microservices / DevOps / Infrastructure design work done? This stuff is my bread and butter! 🍞 🍴 Get in touch, and let's talk!

Buy me a coffee β˜•οΈ

A sponsorship is too much commitment, and a book is TL;DR? Hit me up with a one-time caffine shot!

Sponsored Projects

I'm supported and motivated by GitHub Sponsors and Patreon patrons who have generously sponsored me.

I regularly donate to / sponsor the following projects. Join me in supporting these geeks, and encouraging them to continue building the ingredients for your favourite recipes!

Project Donate via..
Komga GitHub Sponsors
Material for MKDocs GitHub Sponsors
Calibre Credit Card / Patreon / LibrePay
LinuxServer.io PayPal
WidgetBot's Discord Widget Patreon
Carl-bot Patreon

Last update: July 9, 2022

Docker Swarm

Highly Available Docker Swarm Design

In the design described below, our "private cloud" platform is:

Design Decisions

Where possible, services will be highly available.**

This means that:

Note

An exception to the 3-nodes decision is running a single-node configuration. If you only have one node, then obviously your swarm is only as resilient as that node. It's still a perfectly valid swarm configuration, ideal for starting your self-hosting journey. In fact, under the single-node configuration, you don't need ceph either, and you can simply use the local volume on your host for storage. You'll be able to migrate to ceph/more nodes if/when you expand.

Where multiple solutions to a requirement exist, preference will be given to the most portable solution.

This means that:

Security

Under this design, the only inbound connections we're permitting to our docker swarm in a minimal configuration (you may add custom services later, like UniFi Controller) are:

Network Flows

Authentication

High availability

Normal function

Assuming a 3-node configuration, under normal circumstances the following is illustrated:

HA function

Node failure

In the case of a failure (or scheduled maintenance) of one of the nodes, the following is illustrated:

HA function

Node restore

When the failed (or upgraded) host is restored to service, the following is illustrated:

HA function

Total cluster failure

A day after writing this, my environment suffered a fault whereby all 3 VMs were unexpectedly and simultaneously powered off.

Upon restore, docker failed to start on one of the VMs due to local disk space issue1. However, the other two VMs started, established the swarm, mounted their shared storage, and started up all the containers (services) which were managed by the swarm.

In summary, although I suffered an unplanned power outage to all of my infrastructure, followed by a failure of a third of my hosts... all my platforms are 100% available1 with absolutely no manual intervention.


  1. Since there's no impact to availability, I can fix (or just reinstall) the failed node whenever convenient. ↩↩


Last update: July 9, 2022

Preparation

Highly Available Docker Swarm Design

In the design described below, our "private cloud" platform is:

  • Highly-available (can tolerate the failure of a single component)
  • Scalable (can add resource or capacity as required)
  • Portable (run it on your garage server today, run it in AWS tomorrow)
  • Secure (access protected with LetsEncrypt certificates and optional OIDC with 2FA)
  • Automated (requires minimal care and feeding)

Design Decisions

Where possible, services will be highly available.**

This means that:

  • At least 3 docker swarm manager nodes are required, to provide fault-tolerance of a single failure.
  • Ceph is employed for share storage, because it too can be made tolerant of a single failure.

Note

An exception to the 3-nodes decision is running a single-node configuration. If you only have one node, then obviously your swarm is only as resilient as that node. It's still a perfectly valid swarm configuration, ideal for starting your self-hosting journey. In fact, under the single-node configuration, you don't need ceph either, and you can simply use the local volume on your host for storage. You'll be able to migrate to ceph/more nodes if/when you expand.

Where multiple solutions to a requirement exist, preference will be given to the most portable solution.

This means that:

  • Services are defined using docker-compose v3 YAML syntax
  • Services are portable, meaning a particular stack could be shut down and moved to a new provider with minimal effort.

Security

Under this design, the only inbound connections we're permitting to our docker swarm in a minimal configuration (you may add custom services later, like UniFi Controller) are:

Network Flows
  • HTTP (TCP 80) : Redirects to https
  • HTTPS (TCP 443) : Serves individual docker containers via SSL-encrypted reverse proxy
Authentication
  • Where the hosted application provides a trusted level of authentication (i.e., NextCloud), or where the application requires public exposure (i.e. Privatebin), no additional layer of authentication will be required.
  • Where the hosted application provides inadequate (i.e. NZBGet) or no authentication (i.e. Gollum), a further authentication against an OAuth provider will be required.

High availability

Normal function

Assuming a 3-node configuration, under normal circumstances the following is illustrated:

  • All 3 nodes provide shared storage via Ceph, which is provided by a docker container on each node.
  • All 3 nodes participate in the Docker Swarm as managers.
  • The various containers belonging to the application "stacks" deployed within Docker Swarm are automatically distributed amongst the swarm nodes.
  • Persistent storage for the containers is provide via cephfs mount.
  • The traefik service (in swarm mode) receives incoming requests (on HTTP and HTTPS), and forwards them to individual containers. Traefik knows the containers names because it's able to read the docker socket.
  • All 3 nodes run keepalived, at varying priorities. Since traefik is running as a swarm service and listening on TCP 80/443, requests made to the keepalived VIP and arriving at any of the swarm nodes will be forwarded to the traefik container (no matter which node it's on), and then onto the target backend.

HA function

Node failure

In the case of a failure (or scheduled maintenance) of one of the nodes, the following is illustrated:

  • The failed node no longer participates in Ceph, but the remaining nodes provide enough fault-tolerance for the cluster to operate.
  • The remaining two nodes in Docker Swarm achieve a quorum and agree that the failed node is to be removed.
  • The (possibly new) leader manager node reschedules the containers known to be running on the failed node, onto other nodes.
  • The traefik service is either restarted or unaffected, and as the backend containers stop/start and change IP, traefik is aware and updates accordingly.
  • The keepalived VIP continues to function on the remaining nodes, and docker swarm continues to forward any traffic received on TCP 80/443 to the appropriate node.

HA function

Node restore

When the failed (or upgraded) host is restored to service, the following is illustrated:

  • Ceph regains full redundancy
  • Docker Swarm managers become aware of the recovered node, and will use it for scheduling new containers
  • Existing containers which were migrated off the node are not migrated backend
  • Keepalived VIP regains full redundancy

HA function

Total cluster failure

A day after writing this, my environment suffered a fault whereby all 3 VMs were unexpectedly and simultaneously powered off.

Upon restore, docker failed to start on one of the VMs due to local disk space issue1. However, the other two VMs started, established the swarm, mounted their shared storage, and started up all the containers (services) which were managed by the swarm.

In summary, although I suffered an unplanned power outage to all of my infrastructure, followed by a failure of a third of my hosts... all my platforms are 100% available1 with absolutely no manual intervention.


  1. Since there's no impact to availability, I can fix (or just reinstall) the failed node whenever convenient. ↩↩


Last update: July 9, 2022

Nodes

Let's start building our cluster. You can use either bare-metal machines or virtual machines - the configuration would be the same. To avoid confusion, I'll be referring to these as "nodes" from now on.

Note

In 2017, I initially chose the "Atomic" CentOS/Fedora image for the swarm hosts, but later found its outdated version of Docker to be problematic with advanced features like GPU transcoding (in Plex), Swarmprom, etc. In the end, I went mainstream and simply preferred a modern Ubuntu installation.

Ingredients

Ingredients

New in this recipe:

  • 3 x nodes (bare-metal or VMs), each with:
    • A mainstream Linux OS (tested on either CentOS 7+ or Ubuntu 16.04+)
    • At least 2GB RAM
    • At least 20GB disk space (but it'll be tight)
  • Connectivity to each other within the same subnet, and on a low-latency link (i.e., no WAN links)

Preparation

Permit connectivity

Most modern Linux distributions include firewall rules which only only permit minimal required incoming connections (like SSH). We'll want to allow all traffic between our nodes. The steps to achieve this in CentOS/Ubuntu are a little different...

CentOS

Add something like this to /etc/sysconfig/iptables:

# Allow all inter-node communication
-A INPUT -s 192.168.31.0/24 -j ACCEPT

And restart iptables with systemctl restart iptables

Ubuntu

Install the (non-default) persistent iptables tools, by running apt-get install iptables-persistent, establishing some default rules (dkpg will prompt you to save current ruleset), and then add something like this to /etc/iptables/rules.v4:

# Allow all inter-node communication
-A INPUT -s 192.168.31.0/24 -j ACCEPT

And refresh your running iptables rules with iptables-restore < /etc/iptables/rules.v4

Enable hostname resolution

Depending on your hosting environment, you may have DNS automatically setup for your VMs. If not, it's useful to set up static entries in /etc/hosts for the nodes. For example, I setup the following:

  • 192.168.31.11 ds1 ds1.funkypenguin.co.nz
  • 192.168.31.12 ds2 ds2.funkypenguin.co.nz
  • 192.168.31.13 ds3 ds3.funkypenguin.co.nz
Set timezone

Set your local timezone, by running:

ln -sf /usr/share/zoneinfo/<your timezone> /etc/localtime

Serving

After completing the above, you should have:

Summary

Deployed in this recipe:

  • 3 x nodes (bare-metal or VMs), each with:
    • A mainstream Linux OS (tested on either CentOS 7+ or Ubuntu 16.04+)
    • At least 2GB RAM
    • At least 20GB disk space (but it'll be tight)
  • Connectivity to each other within the same subnet, and on a low-latency link (i.e., no WAN links)

Last update: July 9, 2022

β›΄ Kubernetes

Preparation

Introduction

My first introduction to Kubernetes was a children's story:

Why Kubernetes?

Why would you want to Kubernetes for your self-hosted recipes, over simple Docker Swarm? Here's my personal take..

Docker Swarm is dead

Sorry to say, but from where I sit, there's no innovation or development happening in docker swarm.

Yes, I know, after Docker Inc sold its platform business to Mirantis in Nov 2019, in Feb 2020 Mirantis back-tracked on their original plan to sunset swarm after 2 years, and stated that they'd continue to invest in swarm. But seriously, look around. Nobody is interested in swarm right now...

... Not even Mirantis! As of Nov 2021, the Mirantis blog tag "kubernetes" had 8 posts within the past month. The tag "docker" has 8 posts in the past 2 years, the 8th being the original announcement of the Docker aquisition. The tag "docker swarm" has only 2 posts, ever.

Dead. Extinct. Like the doodoo.

Once you go Kubernetes, you can't go back

For years now, I've provided Kubernetes design consulting to small clients and large enterprises. The implementation details in each case vary widely, but there are some primitives which I've come to take for granted, and I wouldn't easily do without. A few examples:

  • CLI drives API from anywhere. From my laptop, I can use my credentials to manage any number of Kubernetes clusters, simply by switching kubectl "context". Each interaction is an API call against an HTTPS endpoint. No SSHing to hosts and manually running docker command as root!
  • GitOps is magic. There are multiple ways to achieve it, but having changes you commit to a repo automatically applied to a cluster, "Just Worksβ„’". The process removes so much friction from making changes that it makes you more productive, and a better "gitizen" ;P
  • Controllers are trustworthy. I've come to trust that when I tell Kubernetes to run 3 replicas on separate hosts, to scale up a set of replicas based on CPU load metrics, or provision a blob of storage for a given workloa, that this will be done in a consistent and visible way. I'll be able to see logs / details for each action taken by the controller, and adjust my own instructions/configuration accordingly if necessary.

Uggh, it's so complicated!

Yes, it's more complex than Docker Swarm. And that complexity can definately be a barrier, although with improved tooling, it's continually becoming less-so. However, you don't need to be a mechanic to drive a car or to use a chainsaw. You just need a basic understanding of some core primitives, and then you get on with using the tool to achieve your goals, without needing to know every detail about how it works!

Your end-goal is probably "I want to reliably self-host services I care about", and not "I want to fully understand a complex, scalable, and highly sophisticated container orchestrator". 1

So let's get on with learning how to use the tool...

Mm.. maaaaybe, how do I start?

Primarily you need 2 things:

  1. A cluster
  2. A way to deploy workloads into the cluster

Practically, you need some extras too, but you can mix-and-match these.


  1. Of course, if you do enjoy understanding the intricacies of how your tools work, you're in good company! ↩


Last update: July 9, 2022
https://geek-cookbook.funkypenguin.co.nz/