The Twelve Factor App is a description of how to build apps that run well on Heroku, but it’s also proven to be of general usefulness in describing good principles for modern web application design.
A twelve factor app is only useful if you have an infrastructure that supports it, however; that’s what we’ll be focusing on here.
Some of the twelve factors are more prescriptive then others, but most of them boil down to a few general principles:
This post will walk you through building an example infrastructure that holds to these principles using Vault to manage MySQL credentials and Consul for service discovery. We’ll build the core, land a demo todo service in the cluster, and then take a look at how the different services interact and how they behave when there are failures.
We’ll be building out the example infrastructure in a cluster with Vagrant; everything needed to follow along is included on github here.
The demo cluster is essentially a bunch of infrastructure to support an example todo
REST api written in Go.
(You can find the source for it here but feel free to
ignore it; the cluster will install a copy built and hosted by Drone
and I’ll explain how to use it later on.)
The example todo
service is very simple and its only dependency is a mysql database, but
sharing database credentials while maintaining environment independence and security is easier said than done.
We can’t keep credentials consistent across all environments because that wouldn’t be secure. We could drop config files on the box for our app to read, but that isn’t the most secure thing either, and it is hard to manage (we would have to keep track of all those creds and where they go as well as have a game plan for changing them.)
Vault solves these problems for us by managing the creation and access to creds with the MySQL Secret Backend, what a given app has access to with Access Control Policies, and pluggable auth backends (our demo will use the App ID Auth Backend).
A second problem is that your MySQL server probably lives at a different address in each environment.
We use consul to provide a consistent way to discover it.
This way we don’t have to care about the actual address to MySQL, we can always just look it up via mysql.service.consul
regardless of what environment we are in.
With these problems solved, we are left with a single todo
artifact that provides our service and gets all of its configuration
from the environment it is installed in. This way we can install the same thing in all environments and add as many instances as
we want to scale it.
Before we start provisioning the cluster, here’s a list of the VMs we will build and what their role in the cluster is.
consul
provides our consul server (in a real environment, we would have several instances to provide high availability.)
Services in the cluster register their address with consul and discover other services through it, and vault makes
use of the key/value store it exposes as a data backend.vault0
is one of two nodes running the vault server. Though vault can be used for much more, we are only using it
to generate and expose MySQL credentials to our todo services. In addition to storing its data in consul, vault
registers its address with consul with a health check configured that ensures only healthy, unsealed instances of vault
are offered up for use.vault1
is the second vm running the vault server. Since it will boot second, it will come online in stand by
mode. This means
that it won’t actually respond to requests from services, but instead redirect them to the leader (vault0
.)
In the event of a failure on vault0
, it will take over as leader and start servicing requests directly.mysql
provides a MySQL server. Vault is configured to manage creds for it using a vaultadmin account and
our todo services will persist todo entries to it.todo0
is an instance of our todo app, a REST api that manages todo entries. It uses MySQL (discovered with consul)
for data persistence and vault (also discovered with consul) to aquire MySQL creds.todo1
is a second instance of the todo service. Since these services are stateless, they are also totally interchangable.Provisioning this cluster is a little more involved than a typical vagrant cluster. Configuring and managing vault shouldn’t be done entirely by configuration management since you’re going to rely on it to keep your secrets safe. We typically want these operations to be initiated (or even performed directly) by a human and not a system that remains perpetually authenticated within our datacenter. Otherwise, we haven’t secured anything - we’ve only added another turtle to the stack.
For this reason, I’ve provided scripts to configure vault rather than doing it in puppet the way the rest of the cluster is configured. These scripts are in no way secure (Probably shouldn’t write out the vault key and root token to files accessible within your production cluster, for instance) but are included to illustrate the separation of configuration concerns while trying to keep the demo cluster automated.
Following are the scripts used to configure vault:
app-id
, is created ahead of time and can be included in your VM with config management.
The second token, user-id
, needs to be unique per instance and made available in some way other than config management to keep our eggs out of a single basket.app-id
used by each instance of our todo
service. When we add it to vault,
we also associate it with the “todo” policy (which in turn allows access to the mysql “todo role” credential generator.)
In addition to setting the app-id here, we have made it available via heira
for puppet to configure the “todo” instances with.user-id
s for our two todo
instances and pass them to vagrant up todo0
and vagrant up todo1
as environment variables so they can be set on the new instances without going through our puppet configuration. In a real environment, this would ensure that only
trusted instances were given a user-id
since a privileged user must be authenticated to create one.
The “todo” service’s init script will pick up the user-id and app-id
and inject them as environment variables into the todo service.# clone the demo repo
git clone https://github.com/benschw/vault-demo.git
cd vault-demo
# clone the puppet modules used to provision our cluster
./puppet-deps.sh
# provision the core infrastructure
vagrant up consul vault0 vault1 mysql
# initialize, unseal, and configure `vault`
./01-init.sh && ./02-unseal.sh && ./03-configure.sh
# mint `user-id`s and provision the `todo` instances configured with them
./04-provision-todo.sh
# confirm that everything went well
./test-todo-service.sh
If you want to watch the todo service come online like in my recording below, just run the test script with watch:
watch --color ./test-todo-service.sh
recording of the cluster being provisioned
Sure the test works, but maybe you want to see the service in action for yourself!
Here’s an example of how to use the todo
api with todo0
’s IP hard wired in:
curl -X POST http://172.20.20.14:8080/todo -d '{"status": "new", "content": "Hello World"}'
{"id":1,"status":"new","content":"Hello World"}
curl http://172.20.20.14:8080/todo/1
{"id":1,"status":"new","content":"Hello World"}
curl -X PUT http://172.20.20.14:8080/todo/1 -d '{"status": "open", "content": "Hello Galaxy"}'
{"id":1,"status":"open","content":"Hello Galaxy"}
# (we can use `todo1`'s IP too)
curl http://172.20.20.15:8080/todo/1
{"id":1,"status":"open","content":"Hello Galaxy"}
curl -i -X DELETE http://172.20.20.15:8080/todo/1
HTTP/1.1 204 No Content
Content-Type: application/json
Date: Thu, 09 Jul 2015 15:21:48 GMT
In the following sections I will talk about how the various components scale and interact, as well as how they (and subsequently the todo service) behave in the face of various failures. Each failure I introduce is accompanied by a recording that will hopefully help illustrate the behaviors.
Each recording has the same seven terminal panes included:
vault
and two todo
vms.
(The top two addresses are vault
, the bottom two are todo
)vault0
, vault1
, mysql
, todo0
, todo1
.Consul uses The Raft Consensus Algorithm to manage a highly consistent and highly available key/value and service discovery cluster. It uses Serf and the Gossip Protocol to share state between the cluster nodes. This essentially allows each node to discover all other nodes (as well as the services registered on them) by simply joining the cluster.
Every VM has a consul client running on it that keeps the primary service of that VM registered with consul for discovery. This way, VMs can come and go, or change IP, and the services that rely on them don’t need to be reconfigured.
…But I only included a single consul server node in this demo, and if we take it away bad things will happen. I’ll skip the “fail” video since this is a solved problem that I only omitted in the demo because I was already up to 6 vms.
(Here’s a post I wrote previously on Provisioning Consul with Puppet, or just scan through all of my posts - I use consul in a lot of them.)
Each vault instance is stateless and a cluster of them is as HA as its backend. We are using consul’s key/value store as a backend, so we can make vault HA by standing up two servers backed by our consul cluster.
Vault provides high availability by electing a leader
server and having additional servers
standing by to take over in the event of a problem. These nodes also do request forwarding to the leader.
Multiple failover instances offer essentially the same resiliency as multiple hot services, but they don’t help to scale the application. In the vault docs it’s stated that “in general, the bottleneck of Vault is the physical backend itself, not Vault core” and suggest scaling the backend (consul in our case) to increase vault’s capacity.
In the following recording, you can see that both todo instances remain healthy unless all vault servers go down.
You can also see that the todo services don’t start failing for awhile after both vault servers are in a critical state. This is because the mysql creds that vault is exposing to the todo service are good for a minute, so the app doesn’t realize vault is gone for up to a minute. We could actually avoid a “todo” failure altogether by hanging onto our old MySQL connection if vault isn’t available, but I left that logic out of the todo service since this is largely a vault demo. Additionally, we wouldn’t want to entirely rely on this, since new services would have nowhere to get their creds from.
Another thing to note is that after I start a vault server back up, I still need to unseal it (by running a script from my host OS in the bottom left pane) before it becomes healthy again. Every time we add a new vault server or restart an existing one, it must be manually unsealed.
At the time of writing this, the most recent vault release is 0.1.2
. This release has a bug that
makes failing over with a consul backend very slow. The bug is fixed in master
, however,
so I went ahead and built the server used in this demo from that.
Next up, the uncomfortable single point of failure: MySQL.
There are, of course, strategies involving replicating to slaves, or even master-master replication, but those are all out of scope for this demo.
We are still, however, exposing the MySQL address to our “todo” service with consul so that if we decided to add in an HA mechanism, it could be done without rework in our application.
For completeness, here’s our service crashing hard when we take away MySQL:
Our todo service is stateless and relies on MySQL to store todo entries. We are always requesting an address to MySQL from consul, which ensures that we only get addresses to healthy instances. We are also registering the “todo” service with consul, so as long as others discover it through that interface, we can scale it by adding instances.
There is no other configuration needed for our app, but if there was, we could add it to consul’s key/value store in order to maintain a clean contract with our system and zero divergence between environments.
This means that our todo service can be installed in any environment without modification, instances can come and go, new instances can be added, and they will all fold neatly into the existing ecosystem.
In the following recording, track the status of the todo instances in the top left pane. The “Health” column shows the instance’s status according to consul, and the “Test” column shows the instance’s status according to a test run from our host OS. For the most part, this test won’t fail because we aren’t testing services known to be unhealthy, but there is a narrow window (up to 5s) after the instance has been stopped but before consul has run its health check and noticed the problem.
Vault is a promising application, but there are a few areas where its immaturity is still cause for concern.
One, it’s hard to automate. Managing secrets securely is hard, and to do it right, a person often has to get involved. The pain (inconvenience) this causes is especially apparent in the “unseal” requirement when starting a vault server. It is mentioned in the docs that there are plans to address this, but for now we’re stuck doing it by hand.
Another potential problem is a weak ecosystem. It’s hard to expect too much since the product is so new, but it’s still difficult to adopt a bleeding edge application to use at the core of your infrastructure when it’s hard to find people talking about it. Additionally, there isn’t community support software (like a puppet module to install it) yet, so you’d be managing a lot yourself (you can hound solarkennedy for this; he did a great job with puppet-consul).
Lastly, it would be really nice if there was a mechanism to scale it. Scaling the backend might be a big part of this, but it’s not the whole picture and there are certain usecases (using vault to protect large amounts or a high throughput of data with the Transit Secret Backend) that probably can’t be met with a single node.
All that being said, Hashicorp has a great track record of building solid, well-received apps (such as consul and vagrant featured in this demo), and vault is still very young (April 28th, 2015), so I have high hopes that this will be another win for safe, resilient, and simple infrastructure.
comments powered by Disqus