KubeCon North America was quite an experience this year. Whether you looked at the amount of booths with platform engineering messaging or just listened to so many of the talks and conversations at parties, it was clear that platform engineering has taken over a big part of the industry mindshare. It was inspiring to talk to many engineering organizations of all sizes about their IDP initiatives and plans. I think most would agree at this point that if platform engineering is not in your budgets for 2024 (or at the very least in your roadmap to look at), you are falling behind.
While all of this is pointing in the right direction for the industry, it was interesting to see how many teams still think (wrongly) that the main goal of platform engineering is to provide a layer of visibility on top of your setup. The focus on visualizing things is at best a secondary priority for a platform team and it drastically reduces the impact that your platform can have on your org performance.
I understand why many architects and executives go this route. It’s the path of least resistance and is tempting because providing visibility seems like a quick fix to all problems. Management feels like they can “touch” things and are in control. But this rarely is the right starting point for your organization. There are several great articles on the topic by people with way more experience than me, like Aaron’s (built IDP at Salesforce) “build your house first, not the front door” or Lee’s (built IDP at Apple) “if you put a pane of glass on a pile of sh’t you observe a pile of sh’it”. So I’ll let you explore that, if you think that might apply to you too.
This leaves us with a burning question, “how do you ship an IDP with high impact and high ROI?”. One of the crucial things that are not talked about enough is your repository structure. The way you organize your repos is essential in shipping a platform design that’s intuitive to operate for developers and easy to maintain and secure for Ops teams. In this article, I will discuss what’s wrong in most repo structures today and what are the fixes you can put in place to make your platform and repo setup truly enterprise ready.
What’s wrong with the status quo
The key issue in most repository structures today is the relationship between workloads and their respective configuration files. In 99% of all setups I have seen, this is a one-to-one relationship, where every workload and every resource are uniquely configured, in an individual way, for each environment.
There are setups with 100s of resource definitions for the same resource type, e.g. an RDS Postgres database. And that’s what’s incredibly inefficient. Why do you need all these variations of a DB? The reality is you don’t. At most you’ll need 5 different flavors of your Postgres, depending on the use cases. But your setup is designed in a way that it favors an explosion of config variations and config files for each resource, without relying on vetted templates.
Here’s an example of what a structure would look like:
The problem is quite obvious when you look at it from this simple perspective. You have a full config representation for all your applications (A to N), across all your environments (A to N). If you do the math, for 10 services with respective dependencies across the usual 4 static environments (i.e. not even considering dynamic preview environments), you are already at 300 to 600 config files with thousands of versions per month.
That’s clearly a nightmare to maintain and can lead to all sorts of problems:
- Auditability and ownership become extremely challenging. How are you supposed to figure out who did what change, when and where? How can you roll back to previous config state?
- Your Ops team is sweating just thinking about the idea of having to maintain this mess. Your senior engineers need to jump in to help out in the individual product teams and start doing shadow operations (i.e. your most expensive resources are not shipping product functionality).
- I don’t have to tell you this is a security nightmare. It’s impossible to enforce any type of policy or governance when everyone is copy pasting config files across all environments for all sorts of resources.
There’s a more professional version of this repository structure where resource config files are moved to dedicated repos, as shown below.
While this is certainly an improvement over the first design, it’s still far from ideal. Take for example the workflow of a developer adding a Postgres to a workload. They would take the latest IaC file used and customize it to reflect the changes they need. They would then run it by the Ops team that would help them debug it and finally have it vetted by the security team. Once that’s done, they can TF apply, add the respective environment variables, register the secrets and deploy to dev. If they want to go all the way to prod, they will have to repeat this process for each environment.
In a mid- or large-size enterprise, the back and forth between devs, Ops and security teams can take up to 2 weeks, just to get a DB provisioned.
A similar, not so pretty picture illustrates the workflow from an Ops perspective. Let’s say you want to upgrade all your Postgres instances from Vx to Vx+1. The first thing you need to do is to try and understand what instances are running and where, as well as what dependencies they have on the different workloads. You then need to work together with the dev teams to update the configs and probably help them do it. Finally you will have to run the changes by the security team. And you have to repeat this for all your Postgres instances.
That’s what’s wrong with this approach to repository structure and configuration management. To solve this, you need to abstract by building golden paths (not cages!) for your developers.
Golden paths and enterprise-grade repositories
The driving principle here is that you want your developers to describe their workload and related dependencies in abstract terms, using a workload specification like Score. This would for instance express “my workload depends on a DB of type Postgres” and send it through the CI-Pipeline with every git-push. A Platform Orchestrator then takes the abstract request and the relevant context (e.g. dev is deploying to env=staging) and matches it to baseline resource definitions to create or update the DB of type Postgres. It then generates the config files and wires everything up.
The advantage for the developer is that the experience is very smooth and easy. With just one file, the developer describes what they need and git-push. The platform takes care of the rest.
This is a game-changer for security and operations teams. Rather than having to maintain 200 slightly different configs of the same resource, they simply have to maintain one baseline config set. Everything that’s “global” is a golden path and it’s all supported by the platform team.
This is what this translates to for your repository structure:
On the workload specification level we recognize that the baseline config set is stripped of all app and infrastructure configs. So no Terraform files or Helm charts are needed to provision resources on a workload level:
- Workload source code
- Workload Spec (Score)
- Docker file
- Pipeline configs (assuming you’re not managing them globally)
On the platform level the situation looks as follows:
- Base resource configs/ IaC: Terraform, Crossplane and friends that generally configure resources. You might for instance have 10 different supported Terraform modules for DBs of type Postgres.
- Resource Definitions: Tell the Platform Orchestrator what, when, and how to use the base resource configs. For instance: If the abstract request of the specification indicates that the workload requires a resource type = postgres and the env-type = staging, then forward the following input to a driver and execute this exact Terraform module.
- Automations/Compliance: Configs for different toolchain components of the platform. In a good platform the platform itself is code and treated as a product.
What this means for your organization
Developers can follow clear golden paths, describing what they need in abstract terms and letting the Platform Orchestrator take care of creating or updating the respective resources. Your setup is fully standardized and enforces clear RBAC and governance settings. Your developers can see how a resource gets configured in production but not change it, all your prod resources are configured exactly the same way.
Going “off the golden path” in this case literally means to fork the respective resource definition and tweak it to enable an edge case. And should your platform team observe a lot of specific custom definitions, it can pull them in and offer them globally with an SLA. You’ve essentially created a new golden path! This enables you to genuinely follow a Platform as a Product approach. So rather than guessing what resources you should keep vs cut, you observe what your customers request and how they behave, and react by supporting things globally.
Designing your repository structure this way lets you eliminate all needs for manual configuration and leads to a reduction of config files floating around your org by up to 95% (those 300-600 files we mentioned earlier are replaced by a handful of vetted templates). This not only is a huge boost to your DORA metrics (MTTR drops, deployment frequency increases, etc.), but it also significantly improves your overall time to market by up to 30%.
If you want to dive deeper, book a demo with our Platform Architects for an individual consultation or check out our docs on how you can enable dynamic configuration management with Score and the Platform Orchestrator.