Infrastructure as Code, or IaC for short, is a fundamental shift in software engineering and in the way Ops think about the provisioning and maintenance of infrastructure. Despite the fact that IaC has established itself as a de facto industry standard for the past few years, many still seem to disagree on its definition, best practices, and limitations.
This article will walk through the evolution of this approach to infrastructure workflows and the related technologies that were born out of it. We will explain where IaC came from and where it is likely going, looking at both its benefits and key limitations.
From Iron to Clouds
Remember the Iron age of IT, when you actually bought your own servers and machines? Me neither. Seems quite crazy right now that infrastructure growth was limited by the hardware purchasing cycle. And since it would take weeks for a new server to arrive, there was little pressure to rapidly install and configure an operating system on it. People would simply slot a disc into the server and follow a checklist. A few days later it was available for developers to use. Again, crazy.
With the simultaneous launch and widespread adoption of both AWS EC2 and Ruby on Rails 1.0 in 2006, many enterprise teams have found themselves dealing with scaling problems previously only experienced at massive multinational organizations. Cloud computing and the ability to effortlessly spin up new VM instances brought about a great deal of benefits for engineers and businesses, but it also meant they now had to babysit an ever-growing portfolio of servers.
The infrastructure footprint of the average engineering organization became much bigger, as a handful of large machines were replaced by many smaller instances. Suddenly, there were a lot more things Ops needed to provision and maintain and this infrastructure tended to be cyclic. We might scale up to handle a load during a peak day, and then scale down at night to save on cost, because it's not a fixed item. Unlike owning depreciating hardware, we're now paying resources by the hour. So it made sense to only use the infrastructure you needed to fully benefit from a cloud setup.
To leverage this flexibility, a new paradigm is required. Filing a thousand tickets every morning to spin up to our peak capacity and another thousand at night to spin back down, while manually managing all of this, clearly starts to become quite challenging. The question is then, how do we begin to operationalize this setup in a way that's reliable and robust, and not prone to human error?
Infrastructure as Code
Infrastructure as Code was born to answer these challenges in a codified way. IaC is the process of managing and provisioning data centers and servers through machine-readable definition files, rather than physical hardware configuration or human-configured tools. Now, instead of having to run a hundred different configuration files, IaC allows us to simply hit a script that every morning brings up a thousand machines and later in the evening automatically brings the infrastructure back down to whatever the appropriate evening size should be.
Ever since the launch of AWS Cloudformation in 2009, IaC has quickly become an essential DevOps practice, indispensable to a competitively paced software delivery lifecycle. It enables engineering teams to rapidly create and version infrastructure in the same way they version source code and to track these versions to avoid inconsistency among IT environments. Typically, teams implement it as follows:
- Developers define and write the infrastructure specs in a language that is domain-specific
- The files that are created are sent to a management API, master server, or code repository
- An IaC tool such as Pulumi then takes all the necessary actions to create and configure the necessary computing resources
And voilá, your infrastructure is suddenly working for you again instead of the other way around.
There are traditionally two approaches to IaC, declarative or imperative, and two possible methods, push and pull. The declarative approach is about describing the eventual target and it defines the desired state of your resources. This approach answers the question of what needs to be created, e.g. “I need two virtual machines”. The imperative approach answers the question of how the infrastructure needs to be changed to achieve a specific goal, usually by a sequence of different commands. Ansible playbooks are an excellent example of an imperative approach. The difference between the push and pull method is simply around how the servers are told how to be configured. In the pull method, the server will pull its configuration from the controlling server, while in the push method the controlling server pushes the configuration to the destination system.
The IaC tooling landscape has been in constant evolution over the past ten years and it would probably take up a whole other article to give a comprehensive overview of all the different options one has to implement this approach to her specific infrastructure. We have however compiled a quick timeline of the main tools, sorted by GA release date:
- AWS CloudFormation (Feb 2011)
- Ansible (Feb 2012)
- Azure Resource Manager (Apr 2014)
- Terraform (Jun 2014)
- GCP Cloud Deployment Manager (Jul 2015)
- Serverless Framework (Oct 2015)
- AWS Amplify (Nov 2018)
- Pulumi (Sep 2019)
- AWS Copilot (Jul 2020)
This is an extremely dynamic vertical of the DevOps industry, with new tools and competitors popping up every year and old incumbents constantly innovating; CloudFormation for instance got a nice new feature just last year, Cloudformation modules.
The good, the bad
Thanks to such a strong competitive push to improve, IaC tools have time and again innovated to generate more value for the end-user. The largest benefits for teams using IaC can be clustered in a few key areas:
- Speed and cost reduction: IaC allows faster execution when configuring infrastructure and aims at providing visibility to help other teams across the enterprise work quickly and more efficiently. It frees up expensive resources to work on other value-adding activities.
- Scalability and standardization: IaC delivers stable environments, rapidly and at scale. Teams avoid manual configuration of environments and enforce consistency by representing the desired state of their environments via code. Infrastructure deployments with IaC are repeatable and prevent runtime issues caused by configuration drift or missing dependencies. IaC completely standardizes the setup of infrastructure so there is a reduced possibility of any errors or deviations.
- Security and documentation: If all compute, storage and networking services are provisioned with code, they also get deployed the same way every time. This means security standards can be easily and consistently enforced across companies. IaC also serves as a form of documentation of the proper way to instantiate infrastructure and insurance in the case employees leave your company with important knowledge. Because code can be version-controlled, IaC allows every change to your server configuration to be documented, logged and tracked.
- Disaster recovery: As the term suggests, this one is pretty important. IaC is an extremely efficient way to track your infrastructure and redeploy the last healthy state after a disruption or disaster of any kind happens. Like everyone who woke up at 4am because their site was down will tell you, the importance of quickly recovering after your infrastructure got messed up cannot be understated.
There are more specific advantages to particular setups, but these are in general where we see IaC having the biggest impact on engineering teams’ workflows. And it’s far from trivial, introducing IaC as an approach to manage your infrastructure can be a crucial competitive edge. What many miss when discussing IaC however, are some of the important limitations that IaC still brings with it. If you have already implemented IaC at your organization or are in the process of doing so, you’ll know it’s not all roses like most blog posts about it will have you believe. For an illustrative (and hilarious) example of the hardships of implementing an IaC solution like Terraform, I highly recommend checking out The terrors and joys of terraform by Regis Wilson.
In general, introducing IaC also implies four key limitations one should be aware of:
- Logic and conventions: Your developers still need to understand IaC scripts, and whether those are written in HashiCorp Configuration Language (HCL) or plain Python or Ruby, the problem is not so much the language as the specific logic and conventions they need to be confident applying. If even a relatively small part of your engineering team is not familiar with the declarative approach (we see this often in large enterprises with legacy systems e.g. .NET) or any other core IaC concepts, you will likely end up in a situation where Ops plus whoever does understand them becomes a bottleneck. If your setup requires everyone to understand these scripts in order to deploy their code, onboarding, and rapid scaling will create problems.
- Maintainability and traceability: While IaC provides a great way for tracking changes to infrastructure and monitoring things such as infra drift, maintaining your IaC setup tends to itself become an issue after a certain scale (approx. over 100 developers in our experience). When IaC is used extensively throughout an organization with multiple teams, traceability and versioning of the configurations are not as straightforward as they initially seem.
- RBAC: Building on that, Access Management quickly becomes challenging too. Setting roles and permissions across the different parts of your organization that suddenly have access to scripts to easily spin up clusters and environments can prove quite demanding.
- Feature lag: Vendor agnostic IaC tooling (e.g. Terraform) often lags behind vendor feature release. This is due to the fact that tool vendors need to update providers to fully cover the new cloud features being released at an ever growing rate. The impact of this is sometimes you cannot leverage a new cloud feature unless you 1. extend functionality yourself 2. wait for the vendor to provide coverage or 3. introduce new dependencies.
Once again, these are not the only drawbacks of rolling out IaC across your company but are some of the more acute pain points we witness when talking to engineering teams.
As mentioned, the IaC market is in a state of constant evolution and new solutions to these challenges are being experimented with already. As an example, Open Policy Agents (OPAs) at present provide a good answer to the lack of a defined RBAC model in Terraform and are default in Pulumi.
The biggest question though remains the need for everyone in the engineering organization to understand IaC (language, concepts, etc.) to fully operationalize the approach. In the words of our CTO Chris Stephenson “If you don’t understand how it works, IaC is the biggest black box of them all”. This creates a mostly unsolved divide between Ops, who are trying to optimize their setup as much as possible, and developers, who are often afraid of touching IaC scripts for fear of messing something up. This leads to all sorts of frustrations and waiting times.
There are two main routes that engineering team currently take to address this gap:
- Everyone executes IaC on a case by case basis. A developer needs a new DB and executes the correct Terraform. This approach works if everybody is familiar with IaC in detail. Otherwise you execute and pray that nothing goes wrong. Which works, sometimes.
- Alternatively, the execution of the IaC setup is baked into a pipeline. As part of the CD flow. the infrastructure will be fired up by the respective pipeline. This approach has the upside that it conveniently happens in the background, without the need to manually intervene from deploy to deploy. The downside however is that these pipeline-based approaches are hard to maintain and govern. You can see the most ugly Jenkins beasts evolving over time. It’s also not particularly dynamic, as the resources are bound to the specifics of the pipeline. If you just need a plain DB, you’ll need a dedicated pipeline.
Neither of these approaches really solves for the gap between Ops and devs. Both are still shaky or inflexible. Looking ahead, Internal Developer Platforms (IDPs) can bridge this divide and provide an additional layer between developers and IaC scripts. By allowing Ops to set clear rules and golden paths for the rest of the engineering team, IDPs enable developers to conveniently self-serve infrastructure through a UI or CLI, which is provisioned under the hood by IaC scripts. Developers only need to worry about what resources (DB, DNS, storage) they need to deploy and run their applications, while the IDP takes care of calling IaC scripts through dedicated drivers to serve the desired infrastructure back to the engineers.
We believe IDPs are the next logical step in the evolution of Infrastructure as Code. Humanitec is a framework to build your own Internal Developer Platform. We are soon publishing a library of open-source drivers that every team can use to automate their IaC setup, stay tuned to find out more at https://github.com/Humanitec.