In this article I argue that the current static approach to application and infrastructure configuration management is the root evil in most cloud-native delivery setups. Static here refers to the way microservices talk to each other and to their dependent resources, as this gets manually scripted by humans for each environment and against a static set of infrastructure components. 99% of all setups today follow this approach that I refer to as “static configuration management”. Your “classic” Helm chart and Terraform setup is almost certainly designed precisely this way.
This static approach causes problems every time a team wants to do things that go beyond the simple update of an image (e.g., rolling back, changing the config, the application-architecture, etc.). Every change in static setups requires alignment of silos inside of the team, adds unnecessary cognitive load, and is hard to maintain or hand over. The problem gets worse as a function of team size and number of services.
The solution I am highlighting is one we’ve seen adopted well at several high performing organizations in the last 2-3 years, especially those building dynamic Internal Developer Platforms (approaching configuration dynamically heavily correlates with the platform engineering movement). We call this approach “dynamic configuration management”. Following this method, configurations are split into environment-agnostic and environment-specific elements. The developer describes the workload and its relationship to the rest of the architecture (several workloads with dependent resources such as DBs, file storage, DNS, etc.) in one file. The actual configuration and representation of the application is dynamically created with every deployment and executed. Using an always apt cooking analogy, rather than delivering a baked cake ready for consumption, where almost nothing (filling, toppings, etc.) can be swapped easily, we instead deliver a cooking recipe and bake a fresh new cake with every deployment.
All sorts of things might be going through your head now. If you come from the Kubernetes world, you will be thinking “aren’t those Helm charts?” No! Or just the YAML files I wrote today? No! Is this Ansible? Chef? No! Are we talking about Infrastructure as Code? Terraform? How is that supposed to work? You’ll need a second to digest all this, but it’s worth it. From all I’m seeing we will look back in five years and wonder why on earth we’ve ever followed the static path we’re currently on. Brew yourself a strong coffee, sit back and let me walk you through this step by step.
Static configuration management
Let’s look at a real example to understand the drawbacks of the current static approach. Let’s assume a simple application. It has one service we call the sample-service. This service is exposed to the public internet using DNS with Route53 and stores data in a database of type Postgres. The service is containerized, we build the image with Github Actions, push it to an image registry of your choice, and it ends up in a namespace on EKS from where it connects to a managed Postgres with AWS RDS. So far, so trivial. Let's assume we have a bare minimum level of sophistication, so we’re not hardcoding anything in the service but we use configuration as code to tell the workload how to run on K8s, as well as how to connect to its dependent resources (DNS and Postgres). To add a little bit of complexity we have an API key as an additional environment variable. Our configuration will be managed in Helm charts. Just because this is the most commonly used approach.
Our Helm chart may look as follows:
<p> CODE: https://gist.github.com/Kasparvongruenberg/f4312a034ca8c254b6385b0ca17dd518.js</p>
We also have this values file:
<p> CODE: https://gist.github.com/Kasparvongruenberg/52a041513569e475b9febf5b431cbef7.js</p>
Our cloud resources are set up using Terraform. Our goal is to be able to get an entire environment with a single command to a pipeline. The pipeline will execute Terraform, which will in turn spin up the resources. It will then pass the credentials to the chart, create manifests, build the container, and deploy everything in the fresh namespace. That operation is non-trivial and requires quite a bit of scripting. But even following the static approach you are using today that should be doable; you think, that’s what you have today and all is well, correct? Not quite, bear with me! What we got is our dev environment. We’ll do it again and call it staging and another time and call it prod.
We end up with three environments, each with individual Helm charts, connecting to individual resources.
The problem with static configuration management
The setup described above sounds pretty dynamic, right? You execute one pipeline and get a new environment. That is probably already relatively rare, in a world where most setups would do nothing but simply swap the image path and deploy against existing infrastructure. But actually, no, the above described setup is a prime time example of static configuration management. Although it felt quite dynamic spinning stuff up, the static nature of this setup is becoming obvious if we now want to do the slightest change that goes beyond the simple update of an image. Such “change” may be any of the following:
- Change configurations by adding an environment variable
- Promote configs from one environment to the next
- Create a diff between deployments to understand where errors were introduced
- Any change to the architecture (add a workload, a S3 bucket)
- Refactor the architecture
- Migrate the current state of one environment into a new one
None of these actions are complex individually. Adding a new environment variable to a values file is something you’d probably be able to do if I woke you up at 3am. The thing is, you will have to perform this action across all files and all environments. And you will need to make sure that you reference the exact right API key for the right environment. This is where config drift between environments kicks in, which is often one of the main root causes for a high change failure rate.
Adding an environment variable is a trivial example. But try to add a workload to an existing environment. You need to touch a whole lot of YAML. Then take the change in dev and promote this to staging. Then something goes wrong, three deployments down the line. Now run a diff in Git and find the root cause… that’s what’s wrong with static configuration management. What is slightly annoying in even a small architecture such as the one in our example gets nasty as soon as we scale even a little bit. The complexity (and risk of problems) within static systems grows exponentially with:
- The number of workloads
- The number of dependent resources
- The number of individual contributors
- The number of deployments
Have you ever done a partial release from a staging environment with 20 microservices into production? Or taken over from a colleague who left and tried to understand the values file of an app that’s 2 years old and has dozens of dependencies? How many weeks has it cost you to get that one under control? With an individual set of configurations that can each branch out into dozens of different directions and connect to dozens of different services and resources in the periphery, this is and will remain a nasty problem. Soon the setup gets too complex to be handled or digested by individual contributors. Add specialized teams into the mix, operations, specialized senior hires, etc. and developers are usually left with a choice between two evils: do I do the change myself and digest all of this complexity at the risk of making costly errors and being derailed from coding, or do I hand off to a central team of other colleagues at the risk of blocking them and wait days for a simple change?
Static configuration management is bad architectural design: error prone, very hard to maintain and making it almost impossible to standardize your setup.
But is this really such a problem? Don’t developers normally just update the image and off they go? Am I describing an edge case here? And why on earth should an entire industry have chosen an approach so unsustainable?
The answer to all of these questions: we’re terrible at understanding the compounding amount of hours of small neglectable actions. We seriously believe that the most common use case in software development is the simple update of an image. And indeed, for that simple update, the static approach works perfectly well. And because of this misconception an entire generation of vendors has optimized against this one approach (at the time of me writing this article 99% of tools out there sync some static files using GitOps operators). The thing is: you are dramatically underestimating how much compounded time you are wasting with things that go beyond the simple update of an image.
The cost of static configuration management
I guess we can all agree that it’s insanely frustrating to go through files and files written by somebody else, trying to understand why stuff doesn’t fit together and production has thrown an error. But these are not the only costs of static configuration management.
The example calculation below is based on hundreds of conversations with engineering teams of all sizes. We asked how often they do things that go beyond the simple update of an image per 100 deployments. We then asked them how much time this eats (including waiting and errors) from devs and operations.
That still looks neglectable, even shown like this. Optimizing for a case like a config change that happens every 20 deployments? Or for spinning up a new environment, which happens every 300 deployments? But if you add all of this up, you end up with high numbers very soon. An average team of 7 developers doing 40 deployments spends around 17 hours on such tasks every single week. I want to encourage you to do this calculation based on your numbers and multiply this with the amount of dev teams in your engineering organization. You will see how many full time equivalents get wasted and that this is a problem too big to ignore.
So to recap, static configuration management means individual contributors script app and infrastructure configurations manually for every environment. This puts a lot of additional cognitive load on the individual contributor when they need to apply any change that goes beyond the simple update of an image. It also makes it almost impossible to standardize configs across the organization and the whole setup becomes hard to maintain and scale. All this is costing teams a lot of time and resources.
The cake recipe: dynamic configuration management
Enough doomsday scenarios, what’s the fix? The fix comes in the form of dynamic configuration management. In all brevity, this means pulling apart the environment agnostic from the environment-specific elements of configuration, describing architecture in one format that works across all environments, and introducing baseline templates to resolve against the respective environment. If you are applying dynamic configuration management, you are using what we call a Declarative Application Model. The Declarative Application Model is the entirety of files used to declaratively describe an application, that is using the approach of dynamic configuration management. In its full blown form, the Declarative Application Model consists of 5 different components:
- The workload (that’s the one coming out of your CI pipelines),
- The workload specification,
- Shared secrets/values,
- The workload profile,
- And resource definitions.
I’m sure that’s a lot to take in; let’s understand this step by step.
Separating out agnostic from specific elements of the environment
Let’s start by understanding what exactly is meant by “separating out environment agnostic from environment specific elements of app configuration”. In plain English: isolate all those things that remain constant, irrespective of the environment. The sample-service in our little example connects to a database of type “Postgres”. It does that in dev, staging, production and in any ephemeral environment you might want to utilize. So that’s definitely an agnostic part of the configuration. So are the parameterized environment variables that tell our workload how to connect to that database. Less agnostic would be the actual Cloud SQL database in an instance running on Google Cloud Platform (GCP) with its specific credentials. So if we say we “separate things out”, what we mean is that we identify all those things that remain constant and put them on one side, while all the other things that might change on an environment basis go on the other side. We even go a step further and differentiate the elements that are workload-specific (and can be owned by the developer working on the service), from the elements we can share across workloads and teams (and might be owned by the platform team). The developer might “govern” the fact that her workload connects to a database. The platform team might want to share certain labels and annotations across all environments.
Let's get into some code and analyze our aforementioned Helm chart to make this more clear.
Looking at the template, we can see that the only environment-specific element of the app configurations is the image (green). Environment variables for instance, could apply across all environments (blue).
Template (pseudo YAML)
Let’s look at the values file next.
Values (pseudo YAML)
Resource credentials are definitely environment-specific (a workload connects to a specific resource using the resource credentials).
Now taking this example, a purely environments abstract description of how the workload connects to its dependencies might look as follows:
<p> CODE: https://gist.github.com/Kasparvongruenberg/3cde53d8c1a7aa1eff2e660ba447b0eb.js</p>
This file is using the Platform Agnostic Workload Specification, a specification that is currently being open sourced by an industry consortium. But variations of this specification can be found in use at many different organizations already. It’s obviously impossible to execute this file and get anything meaningful. We a.) have to add the environment-specific elements of the configuration and b.) add a whole lot more configuration to make this resolve against any run-time.
This is where we add in workload profiles that we can apply the abstract workload model against. Basically baseline templates. For Kubernetes, for instance, this might be something like an empty base helm chart containing things like CPU minimum allocation, labels and annotations that should be used across environments or certain side-cars. The values yaml of such a workload profile might look like this:
values-YAML of a workload profile
<p> CODE: https://gist.github.com/Kasparvongruenberg/8e7f2add79057a0b28ec8faec1f55051.js</p>
The workload specification and this workload profile in combination result in all the environment-agnostic elements of an application configuration, depending on the run-time environment.
In our Kubernetes example we would get manifests dynamically created. And here we are; this is what dynamic configuration management means. We are dynamically creating the manifests, depending on the context we are deploying into.
Weaving in resources
So far we have only really taken app configurations into account. And here you could say “well fine, one more abstraction like Helm charts so I can drive more standardization”. But the beauty of the dynamic approach becomes apparent once we add infrastructure configuration into the picture.
If we only look at app configs and the infrastructure is already there and doesn’t change, we could just add a lookup file to match the static infrastructure and resolve parameterized environment variables. But we can take dynamic configuration management to the next level by introducing the Declarative Application Model. The Declarative Application Model is the entirety of files used to declaratively describe an application, that is using the approach of dynamic configuration management. Let’s understand this step-by step.
In our above example we are missing two things:
- Information on where to find the specific resource credentials. In our example for Cloud SQL and DNS. Let’s call those infrastructure profiles.
- A way to look up what resources (DB, file storage, cluster, DNS, API keys) to use for which environment. Let’s call this resource definitions.
Resource definitions can be a register of either the plain resources (I am this Postgres and here are my credentials for you to retrieve) or services that execute some IaC and retrieve the credentials that are the result of executing says IaC (Terraform, CloudFormation, Crossplane, Pulumi, etc.).
Resource definitions are just a way to look up what workload profile (what Postgres that exists or has to be created in our example) should be used for in what context. It might contain things like:
- For staging use an existing cloud-sql instance with database name A and user credentials B
- For pr-environments create a new cloud-sql instance by executing the following Terraform file and adding a side-car proxy.
This leaves us with the environment agnostic elements:
- Workload specification
- Workload profile
And the environment specific elements:
- The workload itself in the form of an image
- The resources as resolved through the resource definitions
- Share values and secrets that are environment specific
And finally, a way of telling us what specific elements to use in what context - the resource matching.
Together they form what we call a Declarative Application Model.
Teams leveraging this approach usually consume a system called Platform Orchestrator to execute the model and create the application, create all dependent resources, wire them up and deploy them. This results in the following:
- The image is injected into the workload specification.
- The workload specification is applied to the workload profile and manifests are created and executed.
- The resource definitions are called to check the context and understand what infrastructure needs to be wired up (in case it already exists) or needs to be created.
- The resource definitions are requested to retrieve the resource credentials.
- The credentials are injected into the container at run-time by resolving parameterized environment variables.
Long story short, dynamic configuration management is just that: dynamically creating the final configuration files by applying environment agnostic elements of the configuration against environment specific elements of the configuration. A representation of the app is dynamically created with every deployment. In dynamic configuration management we follow the approach of “every day is day zero”. There is no “legacy” in the classical sense of scripts that are wired together. I could delete any application and execute the files of the declarative application model and the result is the running application.
The Declarative Application Model gives us a great way of structuring our thoughts around this. Workloads and workload specifications are specific to a workload or app. But workload profiles and resource definitions could be shared across workloads, apps, teams and even organizations. This is really powerful because it drives standardization by design, with low cognitive load and without removing context from developers!
To make it tangible, in an organization that wants to drive a high degree of standardization you would fence the repositories containing the workload profile and the resource definition and let only ops/infra teams adjust them. In those cases developers could request a new resource but the creation/matching would happen without them being able to influence how and why which infrastructure component would be matched. These setups are characterized by a high degree of abstraction and thus standardization.
In organizations that want to provide golden paths but be less prescriptive we would leave developers the choice on whether to just take the default infrastructure or go down to the Terraform level to tweak things in detail.
As a rule of thumb we can state that the “lower” you go in the model the more cognitive load you’re adding but the less abstraction you’ll be confronted with.
Taking the analogy of baking a cake again: in the static approach we get the ready made cake. If you want to change the topping, it gets ugly. You remove it, lots of mess and it's hard to replicate. Dynamic configuration management is the art of just writing a recipe (the Declarative Application Model) and letting a baking machine (the Platform Orchestrator) bake it for you with every deployment.
First, the dynamic approach solves all the downsides we see in the static world. Any change that goes beyond the simple update of an image (rolling back, spinning up a new environment, changing the architecture, etc.) is now significantly easier to do, as it only requires us to change the abstract workload configuration and redeploy. But there are lots of other positives:
- Standardization by design: Using dynamic configuration management we not only differentiate the environment-agnostic from the environment-specific elements of configuration, we also share workload profiles across multiple workloads/apps or teams. This limits the variance between configurations significantly. Individual contributors focus on the abstract workload specification (only one per workload) and which is the same across all environments. Platform teams can govern workload profiles and resource definitions. This way of working leads to standardization by design of all configuration components. Even security reviews become faster, as you only need to do them once to be able to get a new resource from a pre-vetted template.
- Reduced maintenance overhead: Similarly, by introducing a standardized way of creating configuration you get rid of the randomness of manual “change by change” configurations. This significantly reduces the overhead of maintaining and documenting existing setups. Something you will be grateful for as the application lifetime increases.
- Reduced change failure rate by eliminating config drift: What connects to what resource is now pulled into one place per app, the resource definitions. The workload specification remains the exact same across any environment. This makes it really hard to have your workload running in prod connect to a test DB (although arguably not impossible).
- Abstract without abstracting: Rather than having to deal and dissect every single file that composes the application, developers can choose to stay “high-level” on the workload specification. At the same time, they can dive into the level of the workload profile and resou any time. This allows them to move fast without losing any context.
- Reduced cognitive load for the developer: The approach of letting developers handle the full depth of configurations from image to resource has led to significantly slower delivery and shadow ops. The recent DevOps benchmarking report paints a good picture of that. Dynamic configuration management gives devs full flexibility with minimal load. Even the config break between local and the cloud can be removed by resolving the workload configuration against something like docker compose dynamically.
- More self-service for devs, without more responsibility: In a dynamic model adding an S3 bucket to your architecture is literally as simple as describing the new resource and adding a parameterized environment variable. As long as S3 buckets are matched by your resource definitions to your environment type, they can be immediately created and wired up. This eliminates the need for putting tickets into JIRA that some poor operations team have to bash through, while developers wait.
- New way of working and new features: There is a wealth of functionality that dynamic configuration management enables that was simply not possible before. Like taking the state of any environment and launching it as a new environment with the exact same resource components. Or getting an end to end audit log of everything that was ever deployed, by who and where, for easy debugging.
Where are we with “dynamic configuration management “?
For now, most organizations using this approach have been defining the specifications themselves and built implementations that would apply the environment agnostic elements of the configuration to the environment specific elements. A project called score.dev attempts to standardize this further. The accelerating trend of platforming and the increased usage of Platform Orchestrators is making this methodology spread faster and faster.
Dynamic configuration management as the key to modern Internal Developer Platforms
Increasingly, the trend to higher standardization and lower cognitive load on developers is leading organizations to adopt platform engineering. Dynamic configuration management and declarative application models are essential to make platforming attempts successful. The above mentioned advantages like standardization by design, reduced cognitive load, abstract without abstracting, etc. are the design principles this movement follows. Internal Developer Platforms adopting the approach of dynamic configuration management are referred to as “dynamic Internal Developer Platforms”. The core of dynamic IDPs is the Platform Orchestrator, which interprets the Declarative Application Model and dynamically creates the representation of the application with every deployment.
Wrapping it up
This was a lot to digest. It’s likely different to all you have experienced so far. But just because all of us are following one path, it doesn’t mean it’s the right one. Dynamic configuration management might not be necessary for every setup, especially for small teams or small architectures. But the impact on maintainability, standardization, ease of use, advanced functionality and drop in change failure rate all contribute to the rapid adoption of this approach. There is a good reason to believe that within the next 10 years this is going to be the predominant approach.