Lately I’ve been using HashiCorp’s Terraform a lot to manage infrastructure. It certainly has some big things going for it; it supports a whole bunch of providers (including on-prem, non-cloud stuff like VMWare and Docker) as well as some database engines and DNS providers and can even manage GitHub teams, it can plan changes before committing them (which CloudFormation only very recently learned), and it can store the current state of your infrastructure in Consul. Also a big step past CloudFormation, it has provisioners including local execution, remote execution, file copying and Chef (strangely no built-in support for Puppet, but the remote-exec can do that) that can reach out to your newly-created instances and take actions on them.
Terraform also has a template provider that’s used any time you need a templated file, such as EC2 instance user-data or dynamically generated scripts to place on hosts. Terraform uses a DSL for its configuration, either the JSON-like but slightly-more-human-readable Hashicorp Configuration Language (HCL) or the same information conveyed in pure JSON. The configuration language supports variables (passed in at the command line or in a file) and is based on string interpolation with a handful of functions defined. It’s also worth noting that Terraform is written in Go; it has a plugin system but only for Providers and Provisioners; there’s no way to add core functionality (I suppose I’ve been spolied by Puppet having such good support for adding core functionality via Ruby, or HashiCorp’s Vagrant having a config file that itself is Ruby).
Now that I’ve been nice and said some great things about Terraform (and it really is; at least for the way my current job is managing infrastructure, I’ve fallen in love with it, and it certainly does fix some shortcomings that I found in CloudFormation, specifically with pre-execution plans and ability to interact with resources), on to my complaints of the day.
My first complaint is that by default, Terraform stores the state of your infrastrucutre in a file in your current working directory. It uses this to attempt to figure out the already-existing resources you’ve created, and only make the required changes. The first time I used terraform, I completely destroyed one of our (luckily non-production) services; coworkers of mine have brought down production services because of this.
Let’s say that we have a Terraform configuration which takes one variable,
environment. That variable determines the VPC and subnets we deploy into, our DNS names, and also gets passed to EC2 instances via user-data. We build our infrastructure with
environment = "prod", and everything works right - we now have a production cluster of our service. Then we want to test some changes, so we run again with
environment = "dev". The naive - and logical - assumption would be that we get a second “dev” cluster of our service. Nope. Terraform finds the
terraform.tfstate file in our current directory, reads it, and takes it to be the current state of our infrastructure. It sees that we changed
environment from “prod” to “dev”… so it destroys our EC2 instances and DNS record, and creates new ones for “dev” (applying the requested changes).
This teaches us two important points:
- Always run
terraform plan. Even if you think your changes are trivial, examine what Terraform will do before running
- Always run
terraformthrough a wrapper. We have a simple Rake task in an internal rubygem that ensures that Terraform will always store state in Consul, so it won’t be locked to one person’s local machine, and also removes any local state files before running so they won’t pollute the run or result in changes intended for one isolated instance of our Terraform configuration from being applied to another.
Terraform’s configuration interpolation has a bunch of built-in functions for working with variables. They’re a subset of what you’d expect in a language that is mainly based around strings, arrays and maps/hashes: split, join, concat, lookup (get a hash item by key), index (find the index of an item in a list), element (return the n’th element of a list), format (sprintf-like), etc. However, there’s no function to retrieve only unique elements from a list. This becomes a problem especially when dealing with multi-AZ/multi-subnet AWS resources, as some of them (e.g. managing a set number individual EC2 instances outside of an ASG, such as when assigning static IPs) require a list of subnets matching the number of resources, and others (cross-AZ ELBs) require a list of unique subnets.
Terraform and its language have no way to add this functionality (see note below); the only option that I’ve found is to wrap Terraform in some sort of runner (I use Rake but you could use any scripting or Make-like language) that does whatever manipulation and calculation is needed, and passes in the necessary values distinct variable values (i.e. the full subnet list, and the unique subnet list, as separate variables). To make this even more difficult, though Terraform supports loading built-time variables from a JSON or HCL file instead of the command line, it only supports taking in variables as strings (even in JSON). So in our subnet example, our wrapper script needs to join the list of subnets into a string (i.e. CSV) and then whenever we use the variable in Terraform, we need to
split() it on our separator character (because Terraform doesn’t support variable setting or manipulation).
“Terraform [has] no way to add this functionality” - I’m aware that I could fork Terraform, learn Go, and submit pull requests for all of the features I think would be useful; and if I had maybe half a dozen less unfinished projects, I’d probably do that. However, this still means that HashiCorp would need to accept and merge my PRs and release a new version, or else I’d need to build and distribute my forked version. Terraform supports plugins, but only for Providers and Provisioners, not language internals. What I’d really like is a way to define plugin functions that could be distributed without having to rebuild all of Terraform.
Terraform variables can have default values defined for them. However, these default values have no way of using other variables. This means that even for relatively common use cases - like a service that has a name and a DNS record, both of which can be overridden but with the DNS record defaulting to “SERVICE_NAME.example.com”, you can’t do that. The only options that I’ve been able to figure out are to either do it in your wrapper script (which means the Terraform configs can’t be run without the wrapper) or use the
coalesce function to give your variable an empty default value, and then choose a second interpolated string if the variable is empty.
Terraform’s configuration language also lacks conditional statements such as
if. This poses a problem with all but the simplest applications, and is certainly likely to be an issue for anyone who wants to do the right thing and use the same tooling to deploy multiple environments. It seems that the only options are to either pass in the necessary information as variables from a wrapper script, or generate Terraform configurations with other tooling. The former works only if the desired result is a variable in your configuration; there’s simply no way that I’ve found to have a conditional around resource(s). The only obvious option for that is to take advantage of Terraform’s ability to read configurations as JSON, and simply generate your entire terraform configuration with another tool.