Platform engineering is taking off. 🚀 It appeared on Gartner’s 2022 Hype Cycle for Software Engineering and was identified as a top strategic technology trend for 2023. It’s also taking center stage at KubeCon and State of Open Con. The platform engineering community has over 11,000 active Meetup members around the world and the Slack channel also with over 11,000 platform practitioners and enthusiasts.
The community growth is impressive. However, many folks are still confused about where this new discipline comes from and, more importantly, how it differs from more established practices like DevOps and SRE. In this article, I’ll provide key historical context and explain how DevOps, SRE, and platform engineering relate.
Evolution of DevOps
“You build it, you run it.” In 2006, Amazon’s CTO Werner Vogels changed everything when he described the company’s new approach to software engineering. Previously, developers handed their code off to operations to run in production. Now, Amazon required developers to deploy and run their applications and services end to end. This was the beginning of DevOps.
Thought leaders stepped in and provided new metrics for organizations to gauge the success of their DevOps efforts. The DevOps bible, “Accelerate,” established new standard metrics: lead time, deployment frequency, change failure rate, and mean time to recovery (MTTR). Reports like Puppet’s State of DevOps Report and Humanitec’s DevOps Benchmarking Study used these metrics to compare practices of top-performing and low-performing organizations and provide insights to the community. Leading engineering organizations leveraged the DevOps philosophy to develop, deliver, and ship software faster and better than ever. Some organizations were now able to deploy hundreds or thousands of times a day, delivering value to their customers at a speed that was unthinkable in the old throw-code-over-the-fence days.
Sounds great, right? And it is. But only for a select few.
The problem with DevOps
DevOps unlocked new levels of productivity and efficiency for some software engineering teams. But for many others, DevOps adoption fell short of their high expectations.
Source: 2023 State of Platform Engineering Report by Puppet
While many organizations are advancing on their DevOps journey, studies like the State of DevOps Report by Puppet or Humanitec’s Benchmarking Study show that too many teams are still stuck in the middle and can’t cross what Humanitec calls the DevOps Mountain of Tears.
DevOps Mountain of Tears: DevOps score 0-100 among respondents.
Source: DevOps Benchmarking Study 2023
Why do most organizations get stuck? Why can’t they enter the brave new world of DevOps? There are often a number of reasons, but there is one common theme: cognitive load. Teams often embark on their cloud journeys thinking a microservice architecture running on Kubernetes will fix all of their problems. However, they underestimate the amount of cognitive load complex tools and setups add for their developers. With literally thousands of tools and frameworks for developers to learn and use, it can become impossible to keep up. All of this complexity gets in the way of developers’ most important job: delivering features.
Source: inspired by Daniel Bryant’s talk at PlatformCon 2022
Manuel Pais and Matthew Skelton document the anti-patterns that arise from poor DevOps adoptions in their book “DevOps Topologies.” In one scenario, an organization shifts left and eliminates dedicated Ops roles. Developers become responsible for infrastructure, managing environments, monitoring, etc., in addition to their existing workload. In these setups, senior developers often bear the brunt of this shift. They must do the work themselves or spend time and resources assisting their junior colleagues.
This “shadow operations” anti-pattern misallocates the organizations’ most expensive and talented resources, hurting the organization’s overall productivity.
While DevOps culture proved that developer self-service can increase productivity and efficiency, it also demonstrated that cognitive (over)load is a major problem that needs to be mitigated. This can be accomplished by providing developers with more structure, standardization, and the right level of abstraction.
The SRE story is similar. Established and popularized by Google, this concept was sold to many engineering organizations as the dream culture everyone should aspire to have. Like DevOps, it was a cultural shift that got a lot of hype.
Benjamin Treynor Sloss defines SREs as being responsible for the “availability, latency, performance, efficiency, change management, monitoring, emergency response and capacity planning of their service(s).” SREs use service-level objectives (SLOs) and error budgets to establish shared expectations for performance and balance reliability and innovation, respectively.
Just like with DevOps, there’s nothing wrong with SRE in theory. SLOs and error budgets are useful metrics, and there’s nothing more valuable than having a reliable production environment with an uptime of 99.9%. However, adopted incorrectly, SRE can cause a lot of problems. This is oftentimes the reality, especially for organizations that lack the amount of resources and talent that Google has.
What happens when the quarterly error budget is eaten up after just two weeks? What happens when an organization’s SREs are constantly overworked and close to burnout because of too many unplanned night shifts? When obstacles like this arise, SRE can become a pretty restrictive function. Your SRE will think twice before accepting any further deployments.
Outside of Google, SREs in most organizations lack the capacity to constantly think about ways to enable better developer self-service or improve architecture and infrastructure tooling while also establishing an observability and tracing setup. Most SRE teams are just trying to survive.
This often results in a very conservative mindset. For good reason, many SREs see themselves as gatekeepers to prevent the next disaster.
Source: DevOps Topologies
The root of the problem is that too many teams try to implement practices from elite engineering organizations without fully taking into account the key differences between the respective setups and resources. DevOps Topologies explains this anti-pattern quite well:
“The Ops engineers now get to call themselves SREs but little else has changed. Devs still throw software that is only 'feature-complete' over the wall to SREs. Software operability still suffers because Devs are no closer to actually running the software that they build, and the SREs still don't have time to engage with Devs to fix problems when they arise.”
The rise of platform engineering
Platform engineering is the “discipline of designing and building toolchains and workflows that enable self-service capabilities for software engineering organizations in the cloud-native era. Platform engineers provide an integrated product most often referred to as an ‘Internal Developer Platform’ covering the operational necessities of the entire lifecycle of an application.”
IDPs are important because they help organizations avoid the aforementioned anti-patterns. In Team Topologies, a book that has since become a platform engineering standard, Skelton and Pais detail how having a dedicated platform team helps organizations overcome the fake SRE or shadow ops anti-patterns.
More broadly, good platform platforms address the problems that arise from poorly adopted DevOps and SRE. Humanitec’s Aeris Stewart explains that “[w]here DevOps creates too much cognitive load for developers, platform engineering seeks to alleviate it by finding the right level of abstraction and paving golden paths. Where fake SRE tends to create bottlenecks for developers, platform engineering prioritizes developer self-service and freedom.”
This is backed up by the research and data in Puppet’s State of DevOps reports for both 2020 and 2021, which found a strong correlation between the increased performance of an engineering organization and the usage of internal platforms.
The core of platform engineering: product mindset!
IDPs are great in theory, but how can organizations make sure that they’re awesome in practice? The answer is to treat the platform as a product.
In his PlatformCon 2022 talk, “Team Topologies” co-author Manuel Pais explains that platforms, like products, rely on voluntary adoption, are designed for ease of use, and change as technology changes. As such, the principles and processes that apply to products should also be applied to platforms.
In practice, this involves conducting user research, creating a product roadmap, soliciting regular feedback from developers, and getting buy-in from various stakeholder groups across the organization. This process should be owned by a dedicated product manager.
Making sure you enforce a product mindset when setting up your platform engineering team is key for long-term success. It prevents your platform team from becoming a glorified helpdesk, spending time and resources on building tools developers don’t want, and failing to sustain developer adoption of the platform.
Thoughtworks Tech Radar explains the benefits of the product approach very well:
“Using a product-thinking approach can help you clarify what each of your internal platforms should provide, depending on its customers. Companies that put their platform teams behind a ticketing system like an old-school operations silo find the same disadvantages of misaligned prioritization: slow feedback and response, resource allocation contention and other well-known problems caused by the silo. We’ve also seen several new tools and integration patterns for teams and technologies emerge, allowing more effective partitioning of both.”
Your goal should be to build an Internal Developer Platform that serves the needs of your customers: the engineers. In the process, you’ll also gain more stability, less config drift, and a true “you build it, you run it” culture.
Here are some other guiding principles we gathered from some of the top-performing engineering organizations:
- Optimize for speed not cost (Courtney Kissler)
- Improve DevEx by reducing cognitive load (Paula Kennedy)
- Research what a great DevEx means to your dev team; otherwise you will end up like this
- Find the right degree of abstraction for your team (Humanitec’s 2023 DevOps Benchmarking Study)
- Provide golden paths not cages (Mathieu Frenette)
- Evangelize your platform internally (Galo Navarro)
Platform engineering is an exciting opportunity for your organization if you can execute it well. Thankfully, the platform engineering community is here to help. A platform product manager I recently chatted with was glad to find a space and community that shares this product mindset when building platforms. In her experience, most other SRE or traditional DevOps communities tend to have a much narrower focus.
I am particularly excited about the latest community initiative: PlatformCon. PlatformCon 2022 was the first-ever virtual conference by and for platform engineers. More than 6,000 platform practitioners and DevOps experts came together over two days to share insights, best practices, and platform stories. The conference boasted an incredible speaker lineup, with thought leaders like Cloud Strategy author Gregor Hohpe, OpenCredo CEO/CTO Nicki Watt, and former Puppet Field CTO Nigel Kersten. We also saw real-life implementation examples from the folks at Netflix, nesto, and Frontside.
PlatformCon 2023 is just around the corner! This year, we’re looking forward to hearing from Defense Unicorns Value Stream Architect Bryan Finster, Honeycomb CTO Charity Majors, Upbound Developer Advocate Viktor Farcic, and Open Sauced CEO Brian Douglas, just to name a few. Don’t miss out, and reserve your spot. In the meantime, check out the community’s weekly email newsletter for the best updates in platform engineering and cloud native.