Deploying a Highly Available Kubernetes cluster to AWS using KOPS

In my previous posts I have talked a lot about deploying a kubernetes cluster. For most part I have used kube-aws from CoreOS which has served me quite well. In the last few months however a lot has happened in the Kubernetes space and a new tool has started becoming very interest called Kops which is a subproject of the kubernetes project.

Both the kube-aws tool and Kops tool started getting support for HA deployments (a requirement for production workloads) and cluster upgrades. One of the major advantages of the Kops tool is its ability to manage and maintain multiple clusters because its stores the ‘cluster state’ in an s3 bucket.

In this post I will do a walkthrough on how to deploy a Highly available cluster using Kops. I will base this on the tutorial page here: https://github.com/kubernetes/kops/blob/master/docs/aws.md The main difference I will describe deploying a HA deployment instead of a regular deployment.

Installing the toolset

Pretty much most of this is covered in the tutorial here(https://github.com/kubernetes/kops/blob/master/docs/aws.md). I am using a mac and will use ‘brew install’ to get the needed command line tools installed. You need both the AWS command line client(aws-cli), Kubernetes client (kubectl) and Kops client installed.

Install the clients:
brew install awscli
brew install kubernetes-cli
brew install kops

Once these tools are installed please make sure to configure aws-cli, you can see here how: https://docs.aws.amazon.com/sdk-for-go/v1/developer-guide/configuring-sdk.html#specifying-credentials

Creating the cluster

First we need to set an environment variable to the s3 bucket where we are going to keep the Kops state of all the deployed clusters. We can do this by simply setting an environment variable as following:

export KOPS_STATE_STORE=s3://my-kops-bucket-that-is-a-secret

This bucket is needed because the Kops tool maintains the cluster state in this s3 bucket. This means we can get an overview of all deployed clusters using the Kops tool and it will query the filestructure on the s3 bucket.

Once the s3 bucket is created and variable is set we can go ahead creating the cluster as following:

kops create cluster --name=dev.robot.mydomain.com --master-zones=eu-west-1a,eu-west-1b,eu-west-1c --zones=eu-west-1a,eu-west-1b,eu-west-1c --node-size=t2.micro --node-count=5

The parameters are relatively self explanatory, it is however important that the name includes the fully qualified domain name. Kops will try to register the subdomain in the route53 hosted zone. The most important parameters that makes the setup HA is to specify multiple availability zones, for each zone it will deploy one master node and it will spread the worker nodes across the specified zones as well.

The above command has created a cluster configuration that is now stored in the s3 bucket, however the actual cluster is not yet launched. You can further edit the cluster configuration as following:

kops edit cluster dev.robot.mydomain.com

Launching the cluster
After you have finished editing we can Launch the cluster:

kops update cluster dev.robot.mydomain.com --yes

This will take a bit to complete, but after a while you should see roughly the following list of running EC2 instances, where we can see the nodes running in different availability zones:

master-eu-west-1c.masters.dev.robot.mydomain.com	eu-west-1c	m3.medium	running
master-eu-west-1b.masters.dev.robot.mydomain.com	eu-west-1b	m3.medium	running
master-eu-west-1a.masters.dev.robot.mydomain.com	eu-west-1a	m3.medium	running
nodes.dev.robot.mydomain.com	eu-west-1a	t2.micro	running
nodes.dev.robot.mydomain.com	eu-west-1b	t2.micro	running
nodes.dev.robot.mydomain.com	eu-west-1b	t2.micro	running
nodes.dev.robot.mydomain.com	eu-west-1c	t2.micro	running
nodes.dev.robot.mydomain.com	eu-west-1c	t2.micro	running

Kops will also have sorted out your kubectl configuration so that we can ask for all available nodes as below:

kubectl get nodes
NAME                                           STATUS         AGE
ip-172-20-112-108.eu-west-1.compute.internal   Ready,master   8m
ip-172-20-114-138.eu-west-1.compute.internal   Ready          6m
ip-172-20-126-52.eu-west-1.compute.internal    Ready          7m
ip-172-20-56-106.eu-west-1.compute.internal    Ready,master   7m
ip-172-20-58-2.eu-west-1.compute.internal      Ready          7m
ip-172-20-69-113.eu-west-1.compute.internal    Ready          7m
ip-172-20-75-48.eu-west-1.compute.internal     Ready,master   8m
ip-172-20-86-155.eu-west-1.compute.internal    Ready          7m

Chaos monkey

I have tried actually killing some of the Master nodes and seeing if I could still schedule a load. The problem I faced here was that the cluster could still operate and the containers remained available, but I could not schedule new workloads. This was due to the ‘etcd’ cluster being deployed as part of the master nodes and suddenly the minimum number of nodes for the etcd cluster was no longer present. Most likely moving etcd out of the master nodes would increase the reliability further.

The good news is that once the master nodes recovered from the unexpected termination the cluster resumed regular operation.

Conclusion

I hope that above shows that it is now relatively easy to setup a HA Kubernetes cluster. In practice its quite handy to have HA cluster, next step is to move out etcd to make the solution even more resilient.