Every software system requires an exhaustive amount of configuration and maintenance: from provisioning servers, deploying code, installing and upgrading services, to storing data, managing DNS records and applying security protocols, there is a tremendous amount to consider. At Paxos, we run extensive microservice deployments in various environments on AWS – maintaining them at a high level while keeping overhead minimal and visibility high is critical, yet challenging. Without an easy way to set up and orchestrate a system, the system can quickly become difficult to understand and maintain. A series of configuration decisions and manual configurations can lead to a system that is inefficient, hard to support and highly unpredictable. Some of the challenges engineers face today around systems are:
- Production systems can be hard to maintain if they are set up without documentation
- Time to set up can be very long, especially if setup has errors
- Money could be spent inefficiently on servers that no one knows had been setup, or their specs are not tracked properly
- Even more money could be wasted on hiring and training dedicated personnel to set up and troubleshoot the system
- Ramping new employees up to maintain a system is a lengthy process, and codifying the architecture makes it significantly easier for a new engineer to understand all components of the system
- It is hard to precisely replicate environments (Dev, QA and Prod)
- It is hard to restore a crashed / destroyed system
Terraform allows every engineer to simply “Write, Plan, and Create Infrastructure as Code”. This framework’s code can be applied to most cloud providers and supports several on-prem capable systems. Terraform is fully extensible, and anyone can write a provider that supports a new type of infrastructure. Almost any stateful system with an API can have a usable Terraform provider. A list of all official providers can be found here. There is also a community of unofficial providers that are written by the open source community. Once written, Terraform modules can be invoked to provision servers (such as EC2 instances or RDS instances on AWS) and set up routing, permissions, access controls and DNS records, which requires little to no manual work. Such system setup is easily repeatable, allowing for the same setup for different environments, while simultaneously creating an accurate recovery story. Terraform is modular and allows every service to be set up and torn down separately, which simplifies troubleshooting a service’s setup and configuration. Terraform supports modules, which are functions with side-effects (variables are parameters and outputs are return values), and can be called from other modules via a flexible source-reference mechanism.
Using Terraform as code that is checked into source control allows the system to be reviewed, documented and tracked. This proves to be highly valuable to make and present architectural decisions, which makes it easy to review the system’s architecture.
Beyond that, using terraform forces a unified process of updating systems, thereby preventing a single person from going rogue and making changes without the rest of the team knowing.
Along with Terraform, Terragrunt allows sending parameters to Terraform. While a terraform code can have variables (such as environments, subnets, domain names, etc), the same code can be invoked multiple times by Terragrunt by providing environment specific values to terraform – for QA, Dev, Staging or production.
Terraform plays nicely with a variety of tools such as Packer, which is used to create cloud images (ex. AWS AMIs), and Chef (Deployment and Configuration server based software deployment). In this post, I’m adding an example that shows how to use packer to create a Cassandra image. I have found Packer to be very helpful in automating the creation of images for our deployments.
Terraform can save its setup state in JSON format, which allows for incremental (or staggered) deployment. Once a state has been saved, it can be read and updated incrementally or deleted.
Here’s an example of how one can setup a typical microservices deployment with Terraform, and then replicate it across environments in AWS:
- Set up the management VPC, security groups and IAM roles, which automatically get written to S3.
- Set up management tools (such as Jenkins, Prometheus and Grafana and Security modules). These new objects get written to S3.
- Set up the environment VPC, security groups and subnets (with peering to the management VPC) – this environment will depend on the Terragrunt parameters, such as environment names and IP ranges.
- Using the “terraform_remote_state” rule, we can read the remote state and set up instances of EC2, RDS, Cassandra, Kubernetes or any other object that are based on the stored resources (VPCs, subnets, iam roles to name a few) that were written to S3.
- Set up a kubernetes cluster for the applications based on the database IPs that were written to S3.
Use case – Cassandra and Packer Terraform
To demonstrate our use of Terraform, I have published an example of Cassandra terraform usage.
A repository with Terragrunt and Terraform code is live here: https://github.com/paxos-bankchain/terraform-examples
In this example, we set up three Cassandra nodes in a private VPC and subnet, as well as a bastion host to allow SSHing into them. Our Terraform code builds an AMI using packer, creates the cassandra using the AMI it created, and then sets the Cassandra boxe’s IPs and other settings in the cassandra nodes for the cluster to come up.
At a high level, by going to the terraform-examples/terragrunt/test directory and running Terragrunt apply-all, a cassandra cluster will be set up in a private VPC along with a bastion host.
This code demonstrates the following key capabilities Terragrunt has:
- The backend is set to S3 with connection details specified in Terragrunt.
- Variables are passed in from Terragrunt.
- Packer build – this section creates an AMI image based on a hash of changes (if nothing has changed, don’t create a new AMI). Packer builds are defined in image.json
- ENI – network interface object is created to allow IPs to be consistent and known during and across instance creations.
- Creating EC2 instances as Cassandra nodes
- Running custom config (IPs, etc.) with user_data once an instance is created.
We have found that this combination of steps works very well and allows us to replicate the same logic to most other types of applications (Kafka, Zookeeper and Nginx to name a few).
Terraform has many provisioners that allow you to address all aspects of your environment setup and customization.
We set up various objects in a staggered manner and run Terraform code via Terragrunt to set up management, VPCs, IAM roles, Security Groups and application hosts. Terraform state is backed by S3.
Terraform comes with its own set of challenges: often, one has to think creatively about how to create a form that sets up a specific system. Using the “null_resource” wildcard allows you to define any custom action that is not officially supported, which we had to rely on significantly.
When developing in Terraform, it is recommended to first perform the actual installation on a bare EC2 instance, and then re-create the same steps in Packer for image, Chef for certain applications and recipes, bash for configuration (“user_data”), and Terraform to setup the IAM, VPC and Hosts.
Running Terraform allows for absolute visibility into our systems and for the ability to replicate and reproduce environments; due to these important benefits, we choose to use Terraform to set up our environments at Paxos.