In our piece “Why every Internal Developer Platform needs a backend” we established the need for platform backends. To recap, the two main value drivers of platform engineering are a.) standardization and b.) automation. Automation means taking the request of the developer and delivering something in response. Standardization means keeping the amount of special snowflakes to a minimum, which leads to commoditization. In highly commoditized setups for example, all S3 buckets for workloads in staging environments look the exact same way.
Frontend-first platforms often have a harder time demonstrating clear ROI and therefore justifying their cost. Stuffing complex business logic into a frontend layer is unmaintainable, leading to simple request types that don’t actually automate much. Frontend layers (portals) are also usually “fire and forget” when it comes to resource lifecycle management. In other words they allow developers to use the frontend to create lots of resources, but the portal does not take on the lifecycle management of those resources. Instead, it delegates this to their human users, leading to non-standardized configuration drift and bespoke maintenance efforts.
More and more infrastructure teams are realizing this, placing individuals from their ranks in the platform engineering team to develop common interfaces that both infrastructure and devex teams can work against. These functions, as part of the discipline Gartner refers to as “Infrastructure Platform Engineering” are tasked with architecting and building platform backends. In this article I want to discuss the two common patterns in the design of platform backend and the following questions:
- What are the characteristics of good backend design for platforms?
- What options do you have?
- How do you approach this task to build one?
Before we go into detail here, I would like to point out that throughout this piece you‘ll assume I am heavily biased. While I am the CEO of Humanitec, I firmly stand behind these views and architectural opinions as an engineer. I would like to invite you to challenge me directly and debate the pros and cons of the approaches outlined.
What is a “backend for an Internal Developer Platform”?
Let's start with a quick definition: A backend for a platform is a set of one or many components that receive user requests from the frontend or other parts of the platform (CI, security systems, etc). The backend executes business logic and changes the state of other parts of the platform accordingly. The backend usually provides interfaces for developer experience (DevEx) teams to build developer self-service flows against. It also provides interfaces for infrastructure and operations teams to codify business logic and build conventions for resource configurations against.
Let’s pick two request types a backend might receive to demonstrate the “job” it performs (in the platform community these are essentially referred to as “golden paths”). Let’s start with one coming in from the user group “application developer”:
- User request: “I need an S3 bucket for my workload ABC across all environments” (request comes through CLI/Portal, API, Code)
- Backend logic execution: some text
- Identify the context by analyzing deployment meta-data coming from CI (for what types of environment does the user request the resource?
- Identify the correct definition of the resource as provided by the infrastructure and operations team.
- Create the S3 buckets in the right configuration for each environment.
- Regenerate the workload configurations for each environment.
- Pull the credentials of the created resources and inject them through secrets at runtime into the workload.
- Run policy checks.
- Run sign-offs.
- Update the entry in the portal (platform frontend).
Next let’s look at a request from the perspective of the infrastructure and operations team:
- User request: “Our global definition for the configuration of S3 buckets in the context “environment = staging” has changed. Here’s the new definition in the form of a Terraform file.
- Backend logic execution: some text
- Identify all workloads that depend on the old definition of the S3 bucket.
- Point them at the new definition.
- Start by updating the S3 configurations in all lower-level environments to test the impact of the change.
- Pull the credentials of the updated resources and inject them through secrets.
- Roll it out across all environments.
At the moment we de-facto differentiate two key design approaches for pipelines: a.) pipeline-based (with CI/CD pipeline) and b.) graph-based (with a Platform Orchestrator).
Principles in backend design
Assuming we have established the need for a backend, we can next cover how to design one. On the surface, we also know by now that a backend should drive standardization and orchestrate infrastructure resources. And it should be able to handle complex workflow logic to manage the progression and procedures as code makes its way through the varying environments. That’s not quite enough, however. Integrability, security, and auditability are equally important.
So here’s my traveler’s guide to backend architecture design. Let’s start with principles:
Drive standardization
Good backends promote standards. This concretely means they lower unnecessary variance in configuration. A good backend acts as a single source of truth and enforces that all resources of a given type (Postgres, DNS, S3) in a given context (such as environment type) are configured exactly the same way at all times. Updates to the global Resource Definition lead to updates in all resources configured by this definition. Should a user have the great idea to change things manually, the system should override back to the global standard.
Manage the entire lifecycle of a resource
Well-designed backends are not “fire and forget.” In other words, they don’t just let the user create tons of resources and then leave it to human beings to maintain and clean up. Good backends manage the entire lifecycle of a resource, from creation to update to sunset.
API-first
An API-first design allows for extensibility and provides the very foundation on which good platform experience and golden paths are built. Having an API call to add an S3 bucket to a workload in staging is powerful. It also allows us to put any interface on top, leaving true interface choice, which is one of the fundamental principles of platform design. And of course, an API-first design helps us describe the desired state of the system in a declarative fashion. A “platform as code” design with all its advantages for disaster recoverability and testing is fundamentally only possible if your backend is API-first.
What I explicitly don’t mean is simply putting a portal on top of existing cloud APIs and sticking to the “fire and forget” world.
Interpret abstract requests from users to enable golden paths
Some of the often described pain points of platform engineering is high developer cognitive load. Reduction of load is only possible if the complexity is going somewhere. As Chris Stephenson likes to say, “You cannot eliminate complexity, you can only shift it.” Shifting it means that the request needs to become more abstract. By making the request more abstract, complexity is reduced for the frontend that is used by the developer and delayed to the backend, where it can be better handled and processed by automation. This automation is fueled by the conventions set by platform engineers, leading to the desired standardization, while the abstract request is transformed into calls to the lower-level infrastructure APIs. Be it through a frontend or code using formats like Score (in over 98% of all interactions between users and platforms we record the use of code-based interaction). Abstract requests are helpful to the user but cannot be interpreted by the lower-level infrastructure APIs. Because of that platform backends have to transform the user’s abstract request into executable app- and infrastructure configurations.
Avoid black-boxing and allow users to leave the path
While it’s helpful for the user to be able to use an abstract request format, this might also box them in. It is imperative to design the backend in a way that allows users to leave the path at any time. And even if the user stays on the path the system should provide timely surfaced information on why it’s resolving an abstract request in a particular way. Let’s play through an example in which we imagine a user adding an S3 bucket to an existing workload. While the abstract request could be a click in a user interface or adding a few lines to a workload config stating you want a “resource of type S3 for workload XYZ”, the backend should a.) clearly surface what Infrastructure as Code (IaC) modules are used to create the respective resource. It should also (and if possible from a security posture) allow the user to fork the base standard, take control of the IaC layer, and leave the golden path through that de facto. Many users paving a new path can be the perfect signal to the platform team to create a new one and make it “golden.”.
Design for easy integration (be multi-everything)
The modern toolchain is multi-everything, as your backend should be. While it’s good practice in platform engineering to start with the lowest common denominator, you will eventually need to support several CI tools, several types of IaC, several secrets managers, etc.
Design for security
Backends are powerful–and vulnerable. A single source of truth, able to enforce a whole lot of things and influence how your most precious systems are configured, has to meet the highest security standards. Tight RBAC at all levels, the integration of secret managers, and keeping cross-network flow to the absolute minimum are imperative.
Have a workflow engine built in
As discussed, being able to handle complex workflow automation is paying into the key drivers of value in every platform engineering initiative. A well-designed backend should be capable of managing the progression between environments. It should enforce policy checks, run sign-offs, update portals, and progress to the next environment. Any workflow should be access-protected. For example, some developers can execute automation workflows all the way to pre-production, but before deploying to production there is an enforced sign-off and policy check.
Keep it maintainable
Backends and orchestration systems can, if ill-designed, become unmaintainable beasts. There is always a trade-off to be made between the granularity and depth of logic a backend can process and the cost of maintenance. This is essentially the key argument for graph-based and against pipeline-based designs. Pipeline-based system designs grow exponentially in complexity as the number of features or paths covered increases. I have rarely seen pipeline-based systems scale beyond serving 300 developers with reasonable maintenance effort. Blown-up platform teams are usually a good indication.
Exploring your options
With design principles under our belt let’s look at how to design your backend.
Imperative, pipeline-based backends (CI+IaC)
It’s probably difficult to call these “backends” in the practical sense. What we’re referring to are backends that use the combination of CI pipelines and IaC to try and fit the criteria of good backend design. This is often a legacy of former DevOps and infrastructure teams morphed into platform engineering teams trying to apply the “old” toolset to the new set of problems. DevOps teams are used to working with CI/CD pipelines, systems that are predominantly focused on the “build” part of the software development lifecycle (SDLC). However, what works well in this section of the lifecycle doesn’t really work as a backend.
In its rudimentary form the user works on an environment per environment basis changing workload configs and infrastructure configurations in unique files by context. CI pipelines run and execute changes to those files with every git-push.
There are some more advanced teams that have gone the extra mile,letting the developer describe requests in an abstract way, then transforming these requests through nested pipelines into individual pipeline runs.
Advantages
The advantages are mostly that teams are familiar with this type of tool, so you can likely find a decent amount of talent on the market to configure and maintain pipeline-based systems.
Disadvantages
Pipelines are start/stop-based workload systems. The same argument applies to the frontend: putting advanced logic into a pipeline system is possible but goes against backend design best practices. While simple logic such as environment progression and sign-offs are well suited for pipeline-based systems, they do not apply to complex business logic in the infrastructure world. Let’s say you want to have a behavior where the abstract request of the user “I need an S3 bucket for my workload” leads to the system creating individually configured buckets per environment, updating workload configs, fetching credentials, running policy checks, injecting secrets through run-time into the container and then deploying. A pipeline-based system would usually hand over this task to another system that is able to execute loops and work on logic with branches, because pipelines are more or less always linear in logic and execution. Not doing so would lead to an explosion in pipeline complexity.
Pipeline-based systems are thus often kept to a minimum in logic and lead to poor outcomes. They do not necessarily drive standardization, do not usually manage the full life cycle of the resource, and it is uncommon to need an extensive amount of pipelines that you in turn need to maintain. Further, they are not API-first, hard to audit, and difficult to consume with multi-interface approaches. This makes it hard to design golden paths and is usually difficult to maintain.
A good way to demonstrate this is to look at how we would design the user example of adding an S3 bucket in a pipeline-based backend, which would look something like this:
Just for this trivial request, we need a fleet of individual pipeline components:
- One to read the request from the user and look up the correct template
- One to clone the template and execute it
- One to fetch the credentials
- One to inject them as secrets
You need this for each environment, otherwise you start building very sophisticated if/then logic. And this is only for the “create” request. This design is (again) “fire and forget.” After the template is cloned it starts drifting, is subject to individual change, and has to be maintained and secured.
Declarative, graph-based backends (Platform Orchestrator)
A Platform Orchestrator is a platform backend that sits post-CI system. It doesn’t replace pipelines, yet it confines them to the build part, which is where they shine. It takes a declarative input of dependencies (frontend input, workload spec, etc). It then interprets the context of the request, matches Resource Definitions provided by the platform, and constructs an execution-ready Resource Graph.
Should the global Resource Definition for a resource of a certain type in a given context change (postgres definition gets updated from V14 to V15) the graph gets rebuilt. Should the user (the developer) declare a new dependency (adding a Redis cache to a workload) the graph will automatically be extended.
There’s one graph per workload and context (one for staging, one for prod, etc.).
This architectural approach allows the use of the exact same definition of a Resource Type per context. In other words, all resources of type “Postgres” in the context of “staging” are configured exactly the same way, are lifecycle managed, and alterations outside of the platform are reverted back to the standard.
The Resource Graph can be executed directly by the Orchestrator or outputted as code into a repository and synced by a GitOps operator such as ArgoCD.
An Orchestrator usually also has deployment pipeline functionality, allowing for highly complex automation logic to manage environment progression.
Advantages
It’s important to note that you can vaguely achieve the same outcome with both graph and pipeline-based designs. Yet if we think back on our first semester of computer science fundamentals, we can recall the theory of directed acyclic graphs and their advantages. The simple explanation is that a graph describes the actual dependencies between nodes (resources) based on their edges (dependencies). We can therefore figure out if any dependencies have changed and so avoid recomputing nodes that are unaffected. A simple topological sort on the graph gives us an ordered list of things to provision, excluding things that don't need to change. A linear execution via a pipeline does not provide true dependency information, so you must always compute everything. Furthermore, if you change resource dependencies, someone needs to update the pipeline to reflect the changes.
In other words, the graph approach is dynamic because it describes your intentions, and you can use those intentions to figure out what the end result should be. The pipeline approach is static because you need to know your desired end result before you start. So building and updating the graph is easy. Building the pipeline is easy too, but updating the pipeline based on changes from the developers is hard.
This kind of Graph-based backends drives an unparalleled degree of standardization. They are able to convert abstract requests from users into executable configurations. If configured well they aren’t a blackbox and allow users to leave the golden path, should they choose to do so. They are API-first by design to allow declarative configuration. They can be RBAC’d and the workflow logic allows for the injection of secrets, sign-offs, policy checks, and more.
Disadvantages
Unlike pipeline-based approaches, Platform Orchestrators are a new concept, and it takes infrastructure platform engineers a moment to understand the logic and approach. Graph-based backends only make sense at a certain scale. For smaller teams with fewer than 50 developers, this approach is probably less recommended.
What now?
One does not need to be terribly original to imagine what I am going to propose next: that you test Humanitec’s Platform Orchestrator to experience the power of a graph-based backend that ticks all the boxes highlighted above.
The easiest way to try it is by using our reference architectures that provide a complete Internal Developer Platform – frontend to backend. If you are interested in going further, and ensuring your organization is doing platform engineering right, talk to one of our platform architects or join the Humanitec Minimum Viable Platform program.