# Shared Storage (Ceph) While Docker Swarm is great for keeping containers running (_and restarting those that fail_), it does nothing for persistent storage. This means if you actually want your containers to keep any data persistent across restarts (_hint: you do!_), you need to provide shared storage to every docker node. ## Design ### Why not GlusterFS? I originally provided shared storage to my nodes using GlusterFS (see the next recipe for details), but found it difficult to deal with because: 1. GlusterFS requires (n) "bricks", where (n) **has** to be a multiple of your replica count. I.e., if you want 2 copies of everything on shared storage (the minimum to provide redundancy), you **must** have either 2, 4, 6 (etc..) bricks. The HA swarm design calls for minimum of 3 nodes, and so under GlusterFS, my third node can't participate in shared storage at all, unless I start doubling up on bricks-per-node (which then impacts redundancy) 2. GlusterFS turns out to be a giant PITA when you want to restore a failed node. There are at [least 14 steps to follow](https://access.redhat.com/documentation/en-US/Red_Hat_Storage/3/html/Administration_Guide/sect-Replacing_Hosts.html) to replace a brick. 3. I'm pretty sure I messed up the 14-step process above anyway. My replaced brick synced with my "original" brick, but produced errors when querying status via the CLI, and hogged 100% of 1 CPU on the replaced node. Inexperienced with GlusterFS, and unable to diagnose the fault, I switched to a Ceph cluster instead. ### Why Ceph? 1. I'm more familiar with Ceph - I use it in the OpenStack designs I manage 2. Replacing a failed node is **easy**, provided you can put up with the I/O load of rebalancing OSDs after the replacement. 3. CentOS Atomic includes the ceph client in the OS, so while the Ceph OSD/Mon/MSD are running under containers, I can keep an eye (and later, automatically monitor) the status of Ceph from the base OS. ## Ingredients !!! summary "Ingredients" 3 x Virtual Machines (configured earlier), each with: * [X] CentOS/Fedora Atomic * [X] At least 1GB RAM * [X] At least 20GB disk space (_but it'll be tight_) * [X] Connectivity to each other within the same subnet, and on a low-latency link (_i.e., no WAN links_) * [ ] A second disk dedicated to the Ceph OSD ## Preparation ### SELinux Since our Ceph components will be containerized, we need to ensure the SELinux context on the base OS's ceph files is set correctly: ``` mkdir /var/lib/ceph chcon -Rt svirt_sandbox_file_t /etc/ceph chcon -Rt svirt_sandbox_file_t /var/lib/ceph ``` ### Setup Monitors Pick a node, and run the following to stand up the first Ceph mon. Be sure to replace the values for **MON_IP** and **CEPH_PUBLIC_NETWORK** to those specific to your deployment: ``` docker run -d --net=host \ --restart always \ -v /etc/ceph:/etc/ceph \ -v /var/lib/ceph/:/var/lib/ceph/ \ -e MON_IP=192.168.31.11 \ -e CEPH_PUBLIC_NETWORK=192.168.31.0/24 \ --name="ceph-mon" \ ceph/daemon mon ``` Now **copy** the contents of /etc/ceph on this first node to the remaining nodes, and **then** run the docker command above (_customizing MON_IP as you go_) on each remaining node. You'll end up with a cluster with 3 monitors (odd number is required for quorum, same as Docker Swarm), and no OSDs (yet) ### Setup OSDs Since we have a OSD-less mon-only cluster currently, prepare for OSD creation by dumping the auth credentials for the OSDs into the appropriate location on the base OS: ``` ceph auth get client.bootstrap-osd -o \ /var/lib/ceph/bootstrap-osd/ceph.keyring ``` On each node, you need a dedicated disk for the OSD. In the example below, I used _/dev/vdd_ (the entire disk, no partitions) for the OSD. Run the following command on every node: ``` docker run -d --net=host \ --privileged=true \ --pid=host \ -v /etc/ceph:/etc/ceph \ -v /var/lib/ceph/:/var/lib/ceph/ \ -v /dev/:/dev/ \ -e OSD_DEVICE=/dev/vdd \ -e OSD_TYPE=disk \ --name="ceph-osd" \ --restart=always \ ceph/daemon osd ``` Watch the output by running ```docker logs ceph-osd -f```, and confirm success. !!! note "Zapping the device" The Ceph OSD container will refuse to destroy a partition containing existing data, so it may be necessary to "zap" the target disk, using: ``` docker run -d --privileged=true \ -v /dev/:/dev/ \ -e OSD_DEVICE=/dev/sdd \ ceph/daemon zap_device ``` ### Setup MDSs In order to mount our ceph pools as filesystems, we'll need Ceph MDS(s). Run the following on each node: ``` docker run -d --net=host \ --name ceph-mds \ --restart always \ -v /var/lib/ceph/:/var/lib/ceph/ \ -v /etc/ceph:/etc/ceph \ -e CEPHFS_CREATE=1 \ -e CEPHFS_DATA_POOL_PG=256 \ -e CEPHFS_METADATA_POOL_PG=256 \ ceph/daemon mds ``` ### Apply tweaks The ceph container seems to configure a pool default of 3 replicas (3 copies of each block are retained), which is one too many for our cluster (we are only protecting against the failure of a single node). Run the following on any node to reduce the size of the pool to 2 replicas: ``` ceph osd pool set cephfs_data size 2 ceph osd pool set cephfs_metadata size 2 ``` Disabled "scrubbing" (which can be IO-intensive, and is unnecessary on a VM) with: ``` ceph osd set noscrub ceph osd set nodeep-scrub ``` ### Create credentials for swarm In order to mount the ceph volume onto our base host, we need to provide cephx authentication credentials. On **one** node, create a client for the docker swarm: ``` ceph auth get-or-create client.dockerswarm osd \ 'allow rw' mon 'allow r' mds 'allow' > /etc/ceph/keyring.dockerswarm ``` Grab the secret associated with the new user (you'll need this for the /etc/fstab entry below) by running: ``` ceph-authtool /etc/ceph/keyring.dockerswarm -p -n client.dockerswarm ``` ### Mount MDS volume On each noie, create a mountpoint for the data, by running ```mkdir /var/data```, add an entry to fstab to ensure the volume is auto-mounted on boot, and ensure the volume is actually _mounted_ if there's a network / boot delay getting access to the gluster volume: ``` mkdir /var/data MYHOST=`hostname -s` echo -e " # Mount cephfs volume \n $MYHOST:6789:/ /var/data/ ceph \ name=dockerswarm\ ,secret=\ ,noatime,_netdev,context=system_u:object_r:svirt_sandbox_file_t:s0\ 0 2" >> /etc/fstab mount -a ``` ### Install docker-volume plugin Upstream bug for docker-latest reported at https://bugs.centos.org/view.php?id=13609 And the alpine fault: https://github.com/gliderlabs/docker-alpine/issues/317 ## Serving After completing the above, you should have: ``` [X] Persistent storage available to every node [X] Resiliency in the event of the failure of a single node ``` ## Chef's Notes Future enhancements to this recipe include: 1. Rather than pasting a secret key into /etc/fstab (which feels wrong), I'd prefer to be able to set "secretfile" in /etc/fstab (which just points ceph.mount to a file containing the secret), but under the current CentOS Atomic, we're stuck with "secret", per https://bugzilla.redhat.com/show_bug.cgi?id=1030402