The first wave of early Internal Developer Platforms (IDPs) is settling. And as the dust settles, it’s good practice to shed some light on what worked and what didn’t. While the early wave of platforms was primarily considered a “frontend-devex” world with a focus on portals, we are observing that the platform engineering movement has all too long neglected a vital piece of the puzzle: the contribution of the infrastructure and operations teams. As we are back to the races with the second generation of platforms, infrastructure platform engineers are gearing up to find responses to some of the current flaws in platform design.
Modern infrastructure and operations teams must find answers to questions like “How do we sustainably manage, secure, and maintain all those resources that developers create in self-service across public and private cloud estates?” “How do we observe them?” “How do we ensure standards?”
As the title of this article suggests, I believe the answer to these questions is a focus on backends for IDPs by these “infrastructure platform engineers” to create common interfaces their I/O colleagues and DevEx counterparts can develop against. This development, and I want to make this quite clear, does not diminish the role of the DevEx groups; it complements them.
So, in this article, I will explain step by step why your platform needs a backend.
The TL;DR is straightforward: platform architecture is much like any other software architecture. And the principles for software architecture apply equally to platform design. This includes – among others – separation of concern. While at a smaller scale you can absolutely put more complex logic into the frontend layer, this doesn’t scale up particularly well. Separation of concerns at some point calls for separating the workflows meant for user interaction from those meant for automation. They have entirely different requirements and as such, are better separated into frontend and backend, making it easier to create and maintain them.
By doing this, you enable two key things that drive value in your platform:
1. Automation through complex yet maintainable logic.
2. Standardization by getting away from “fire and forget” designs for resource creation and into the world of platform backends that resume end-to-end life-cycle management of resources.
Before we get to it, allow me to preamble that anything outlined here only applies at scale. You can start with a developer team of less than 50 people, but you might only see your desired ROI a lot later or once you start to scale out. Platform engineering can return outlier investment at scale if you get it right, but the effects likely only kick in once you truly get to scale.
Value drivers in your platform
The value (actual ROI) in platforms is driven primarily by two things: automation
and standardization. These two things reduce the time to market (TTM) and cost for the business. If you cannot prove this your initiative will crumble in months.
The below value funnel of platform engineering is a nice way to display how perception morphs to impact on personas as well as the organization as a whole:
Let’s zoom in on the two key value drivers. The automation bit is straightforward: if today it takes the enterprise 13 weeks and 17 tickets to create a service and all dependent resources across three environments all the way to production, you have a problem. Developers wait, operations and security work through repetitive tickets. If you can automate manual tasks and tickets, you drive value.
The standardization bit is less straightforward because our industry has developed a terrible habit of “special snowflake syndrome.” If you look at the random configurations of all your Postgres databases in staging, you will find they’re all different. Some are a bit different, some very. But different is different. And if you have 300 RDS instances and each one of them is configured differently, you need to maintain 300 versions, secure them and update them. A perfect example of zero standardization. Does it help if all those instances are configured using Infrastructure as Code (IaC)? Not really. Now, is there a value in having 300 different configs for RDS? Do we get more performance for our applications? Freedom? No, there are zero advantages and 300 disadvantages. How many ways are there to configure RDS in staging? Four maybe? Five if we stretch it? Definitely not 300. So you have 295 configurations that just create problems. If you can drive standardization, you drive value and when successful, you make infrastructure a commodity.
Why do you need a backend to drive automation?
Here’s a good way of thinking about automation: Low automation -> low return, high automation -> high return.
Let’s be more tangible and consider another use case. Assume you are a developer and you’ve been developing your service for a while. Now you want to add a new S3 to this service and you’re using a platform with a low degree of automation. In most platforms, this would concretely mean the following flow:
- Go to Backstage
- Click on “new S3”
- Backstage calls the GitHub API and executes a Terraform module
You have your S3, sure. But where’s the value-add of the platform? Why not just execute the Terraform module directly? The relative return of this action is minimal because the bulk of the work comes now and is still manual (reconfigure the workload configs, set secrets, create other buckets for other environments, policy check, deploy).
What would a highly automated workflow for the same request look like?
- You open a workload spec (or Backstage, it doesn’t matter) and indicate that your workload needs an S3 bucket. Then you git push.
- The frontend will hand this abstract request to the backend, the Platform Orchestrator.
- The Orchestrator analyzes the request and the context by reading the metadata of the deployment.
- It finds that you need an S3 bucket that should be consumed by a specific workload, and that you need that in all environments all the way up to production.
- It now identifies the correct configuration for each environment by reading Resource Definitions provided by the platform team. It then creates the Resources for each environment.
- It creates workload configurations that now contain the reference of the S3 buckets.
- It fetches the credentials of each resource and injects them into the container through secrets at run time.
- It policy checks everything, runs a sign-off flow and deploys.
That’s a lot of automation, and it provides a lot of value! It’s also complex, in fact very complex. Which brings us to the next conclusion; low-complexity logic drives little value on average, high-complexity logic consistently drives high value.
This high-complexity workflow logic needs to be processed somewhere. So where do we put it? It’s actually in general a good idea to think about platform architecture like you think about software architecture. And one of the driving principles in software architecture is “separation of concerns.” Applied to the platform architecture world, this means that from some level of complexity onwards, you want to separate the workflows meant for user interaction from those meant for automation. They have completely different requirements and as such are better separated into frontend and backend, making it easier to create and maintain them.
Let’s make it more tangible. If we’re focussing lots of time on the frontend/UI layer but we’re not paying attention to the logic underneath, we‘ll create platforms that might look nice but do not drive ROI. They will become impossible to maintain and see limited adoption in the user base. A backend however, particularly a Platform Orchestrator, is all about executing these complex sequences dynamically and executing complex logic in a maintainable way.
What do we learn? High degree of automation, high value, high complexity -> you need a backend to execute high-complexity workflow logic.
Why do you need a backend to drive standardization?
Think of our example outlined above. Zooming into your estate, we’ll likely find hundreds of instances of a certain resource type (Postgres for instance) running just in staging environments. And more often than not, they are all configured differently. With modern self-service portals in place, it’s now even easier for developers to spin up the n-th resource with the click of a button. But what’s our strategy to maintain, secure, and update this fleet of resources? How do we make sure we observe them, and how do we tear them down if they’re no longer needed? The answer for most infrastructure and operations teams now is simply manual labor. But in times of scarce resources, linearly scaling your operations team with your development team doesn’t cut it and makes the teams reactive rather than proactive. So what to do?
Conceptually, that’s pretty easy to grasp. In a perfect world, all our resources of a certain type (for instance RDS) should be configured the exact same way at all times. But how the heck do you design this? A good way to understand is to look at how not to design this: by taking a frontend and letting it call GitHub to execute a Terraform file. That method is “fire and forget.” The second this request is made, the resource is “on its own”. In other words, the platform is not managing the resource lifecycle.
If you want to keep all your resources in sync with a global standard, the platform absolutely needs to manage the entire life-cycle of the resource. This is because it needs to revert changes on the individual resource back to the global standard. Should the standard receive an update, all resources depending on this standard should be updated as well.
It’s almost impossible to achieve this sustainably in frontend logic. What you need is a backend, and the best platform backends are graph-based. This approach is superior because the graph captures the dependencies of all parts of an application. Changes to any node in the graph can flow up and down and influence other nodes as needed. For example, the updated connection string of your Postgres instance can be injected into all dependent services so they all can still connect after the change. In practice, developers declare abstract dependencies. The system then creates a resource graph and continuously updates and enforces it. The graph maps the dependencies of all workloads to their dependent resources, so you know what resources you have to manage the lifecycle for and what global rules you should apply, in which context.
What this does is change the relationship between workloads and individual configurations by context from a one-to-one to a one-to-many relationship. In other words, in highly standardized setups all your workloads in staging depend on Postgres that are configured against a common template. So if this one template gets updated all resources “depending” on this template will get fleet-updated too.
The ROI of such a system design is enormous. If all your Postgres DBs in staging share exactly the same template from their creation throughout their lifecycle, all of a sudden you just have to maintain those templates and not individual configurations. To enforce this standardization at scale, you need to orchestrate resources continuously. This is best done with a platform backend.
And now?
Now, you probably agree that your platform needs a backend. That's all fine and good, but how do you understand how to design a backend and what your options are? And from here, how do you concretely get started?
A good way to get started hands-on is to set up a platform reference architecture that comes with pre-configured frontend and backend, enabling you to understand them more deeply.
Looking for help? Reach out to one of our platform architects and get started now.